Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Using Condor An Introduction Condor Week 2008.

Similar presentations


Presentation on theme: "1 Using Condor An Introduction Condor Week 2008."— Presentation transcript:

1 1 http://www.cs.wisc.edu/condor Using Condor An Introduction Condor Week 2008

2 2 http://www.cs.wisc.edu/condor The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.

3 3 http://www.cs.wisc.edu/condor The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who:  face software engineering challenges in a distributed UNIX/Linux/NT environment  are involved in national and international grid collaborations,  actively interact with academic and commercial users,  maintain and support large distributed production environments,  and educate and train students.

4 4 http://www.cs.wisc.edu/condor Some free software produced by the Condor Project › Condor System › VDT › Metronome › ClassAd Library › DAGMan › Hawkeye › GCB › MW › NeST / LotMan › Stork › And others… all as open source

5 5 http://www.cs.wisc.edu/condor Full featured system › Flexible scheduling policy engine via ClassAds  Preemption, suspension, requirements, preferences, groups, quotas, settable fair-share, system hold… › Facilities to manage BOTH dedicated CPUs (clusters) and non- dedicated resources (desktops) › Transparent Checkpoint/Migration for many types of serial jobs › No shared file-system required › Federate clusters w/ a wide array of Grid Middleware › Workflow management (inter-dependencies) › Support for many job types – serial, parallel, etc. › Fault-tolerant: can survive crashes, network outages, no single point of failure. › Development APIs: via SOAP / web services, DRMAA (C), Perl package, GAHP, flexible command-line tools, MW › Platforms: Linux i386/IA64, Windows 2k/XP, MacOS, Solaris, IRIX, HP-UX, Compaq Tru64, … lots.

6 6 http://www.cs.wisc.edu/condor Meet Frieda. She is a scientist with a big problem.

7 7 http://www.cs.wisc.edu/condor Frieda’s Application … Run a Parameter Sweep of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z  20×10×3 = 600 combinations  F takes on the average 6 hours to compute on a “typical” workstation ( total = 600 × 6 = 3600 hours )  F requires a “moderate” (256MB) amount of memory  F performs “moderate” I/O - (x,y,z) is 5 MB and F(x,y,z) is 50 MB

8 8 http://www.cs.wisc.edu/condor I have 600 simulations to run. Where can I get help?

9 9 http://www.cs.wisc.edu/condor While sharing a beverage with some colleagues, she shares her problem. Somebody asks “Have you tried Condor? It’s free.”

10 10 http://www.cs.wisc.edu/condor Getting Condor › Available as a free download from http://www.cs.wisc.edu/condor › Download Condor for your operating system  Available for most UNIX (including Linux and Apple’s OS/X) platforms  Also for Windows XP / 2003 / Vista

11 11 http://www.cs.wisc.edu/condor Condor Releases › Stable / Developer Releases  Version numbering scheme similar to that of the (pre 2.6) Linux kernels …  Major.minor.release Minor is even (a.b.c): Stable –Examples: 6.6.3, 6.8.4, 6.8.5 –Very stable, mostly bug fixes Minor is odd (a.b.c): Developer –New features, may have some bugs –Examples: 6.7.11, 6.9.1, 6.9.2

12 12 http://www.cs.wisc.edu/condor Frieda Installs a “Personal Condor” on her machine… › What do we mean by a “Personal” Condor?  Condor on your own workstation  No root / administrator access required  No system administrator intervention needed › After installation, Frieda submits her jobs to her Personal Condor…

13 13 http://www.cs.wisc.edu/condor your workstation personal Condor 600 Condor jobs

14 14 http://www.cs.wisc.edu/condor Personal Condor?! What’s the benefit of a Condor “Pool” with just one user and one machine?

15 15 http://www.cs.wisc.edu/condor Your Personal Condor will... › Keep an eye on your jobs and will keep you posted on their progress › Implement your policy on the execution order of the jobs › Keep a log of your job activities › Add fault tolerance to your jobs › Implement your policy on when the jobs can run on your workstation

16 16 http://www.cs.wisc.edu/condor Definitions › Job  The Condor representation of your work › Machine  The Condor representation of computers and that can perform the work › Match Making  Matching a job with a machine “Resource”

17 17 http://www.cs.wisc.edu/condor Job Jobs state their requirements and preferences: I need a Linux/x86 platform I want the machine with the most memory I prefer a machine in the chemistry department

18 18 http://www.cs.wisc.edu/condor Machine Machines state their requirements and preferences: Run jobs only when there is no keyboard activity I prefer to run Frieda’s jobs I am a machine in the physics department Never run jobs belonging to Dr. Smith

19 19 http://www.cs.wisc.edu/condor The Magic of Matchmaking › Jobs and machines state their requirements and preferences › Condor matches jobs with machines based on requirements and preferences

20 20 http://www.cs.wisc.edu/condor Getting Started: Submitting Jobs to Condor › Overview:  Choose a “Universe” for your job  Make your job “batch-ready”  Create a submit description file  Run condor_submit to put your job in the queue

21 21 http://www.cs.wisc.edu/condor 1. Choose the “Universe” › Controls how Condor handles jobs › Choices include:  Vanilla  Standard  Grid  Java  Parallel  VM

22 22 http://www.cs.wisc.edu/condor Using the Vanilla Universe The Vanilla Universe: – Allows running almost any “serial” job – Provides automatic file transfer, etc. – Like vanilla ice cream Can be used in just about any situation

23 23 http://www.cs.wisc.edu/condor 2. Make your job batch- ready Must be able to run in the background No interactive input No GUI/window clicks No music ;^)

24 24 http://www.cs.wisc.edu/condor Make your job batch-ready (continued)…  Job can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices  Similar to UNIX shell: $./myprogram output.txt

25 25 http://www.cs.wisc.edu/condor 3. Create a Submit Description File › A plain ASCII text file › Condor does not care about file extensions › Tells Condor about your job:  Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later) › Can describe many jobs at once (a “cluster”), each with different input, arguments, output, etc.

26 26 http://www.cs.wisc.edu/condor Simple Submit Description File # Simple condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = my_job Output = output.txt Queue

27 27 http://www.cs.wisc.edu/condor 4. Run condor_submit › You give condor_submit the name of the submit file you have created:  condor_submit my_job.submit › condor_submit:  Parses the submit file, checks for errors  Creates a “ClassAd” that describes your job(s)  Puts job(s) in the Job Queue

28 28 http://www.cs.wisc.edu/condor ClassAd ? › Condor’s internal data representation  Similar to classified ads (as the name implies)  Represent an object & its attributes Usually many attributes  Can also describe what an object matches with

29 29 http://www.cs.wisc.edu/condor ClassAd Details › ClassAds can contain a lot of details  The job’s executable is analysis.exe  The machine’s load average is 5.6 › ClassAds can specify requirements  I require a machine with Linux › ClassAds can specify preferences  This machine prefers to run jobs from the physics group

30 30 http://www.cs.wisc.edu/condor ClassAd Details (continued) › ClassAds are:  semi-structured  user-extensible  schema-free  Attribute = Expression

31 31 http://www.cs.wisc.edu/condor Example: MyType = "Job" TargetType = "Machine" ClusterId = 1377 Owner = "roy" Cmd = "sim.exe" Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024)>=ImageSize) … ClassAd Example String Number Boolean

32 32 http://www.cs.wisc.edu/condor The Dog ClassAd Type = “Dog” Color = “Brown” Price = 12 ClassAd for the “Job”... Requirements = (type == “Dog”) && (color == “Brown”) && (price <= 15)...

33 33 http://www.cs.wisc.edu/condor The Job Queue › condor_submit sends your job’s ClassAd(s) to the schedd › The schedd (more details later):  Manages the local job queue  Stores the job in the job queue Atomic operation, two-phase commit “Like money in the bank” › View the queue with condor_q

34 34 http://www.cs.wisc.edu/condor Example condor_submit and condor_q % condor_submit my_job.submit Submitting job(s). 1 job(s) submitted to cluster 1. % condor_q -- Submitter: perdita.cs.wisc.edu : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 my_job 1 jobs; 1 idle, 0 running, 0 held %

35 35 http://www.cs.wisc.edu/condor Input, output & error files › Controlled by submit file settings › You can define the job’s standard input, standard output and standard error:  Read job’s standard input from “input_file”: Input= input_file Shell equivalent: program <input_file  Write job’s standard ouput to “output_file”: Output= output_file Shell equivalent: program >output_file  Write job’s standard error to “error_file”: Error= error_file Shell equivalent: program 2>error_file

36 36 http://www.cs.wisc.edu/condor Email about your job Condor sends email about job events to the submitting user Specify “notification” in your submit file to control which events: Notification= complete Notification= never Notification= error Notification= always Default

37 37 http://www.cs.wisc.edu/condor Feedback on your job › Create a log of job events › Add to submit description file: log = sim.log › Becomes the Life Story of a Job  Shows all events in the life of a job  Always have a log file

38 38 http://www.cs.wisc.edu/condor Sample Condor User Log 000 (0001.000.000) 05/25 19:10:03 Job submitted from host:... 001 (0001.000.000) 05/25 19:12:17 Job executing on host:... 005 (0001.000.000) 05/25 19:13:06 Job terminated. (1) Normal termination (return value 0)...

39 39 http://www.cs.wisc.edu/condor Example Submit Description File With Logging # Example condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = /home/frieda/condor/my_job.condor Log = my_job.log ·Job log (from Condor) Input = my_job.in ·Program’s standard input Output = my_job.out ·Program’s standard output Error = my_job.err ·Program’s standard error Arguments = -a1 -a2 ·Command line arguments InitialDir = /home/frieda/condor/run Queue

40 40 http://www.cs.wisc.edu/condor “Clusters” and “Processes” › If your submit file describes multiple jobs, we call this a “cluster” › Each cluster has a unique “cluster number” › Each job in a cluster is called a “process”  Process numbers always start at zero › A Condor “Job ID” is the cluster number, a period, and the process number (i.e. 2.1)  A cluster can have a single process Job ID = 20.0 ·Cluster 20, process 0  Or, a cluster can have more than one process Job ID: 21.0, 21.1, 21.2·Cluster 21, process 0, 1, 2

41 41 http://www.cs.wisc.edu/condor Submit File for a Cluster # Example submit file for a cluster of 2 jobs # with separate input, output, error and log files Universe = vanilla Executable = my_job Arguments = -x 0 log = my_job_0.log Input = my_job_0.in Output = my_job_0.out Error = my_job_0.err Queue ·Job 2.0 (cluster 2, process 0) Arguments = -x 1 log = my_job_1.log Input = my_job_1.in Output = my_job_1.out Error = my_job_1.err Queue ·Job 2.1 (cluster 2, process 1)

42 42 http://www.cs.wisc.edu/condor % condor_submit my_job.submit-file Submitting job(s). 2 job(s) submitted to cluster 2. % condor_q -- Submitter: perdita.cs.wisc.edu : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 4/15 06:52 0+00:02:11 R 0 0.0 my_job –a1 –a2 2.0 frieda 4/15 06:56 0+00:00:00 I 0 0.0 my_job –x 0 2.1 frieda 4/15 06:56 0+00:00:00 I 0 0.0 my_job –x 1 3 jobs; 2 idle, 1 running, 0 held % Submitting The Job

43 43 http://www.cs.wisc.edu/condor Back to our 600 jobs… › We could put all input, output, error & log files in the one directory  One of each type for each job  That’d be 2400 files (4 files × 600 jobs)  Difficult to sort through › Better: Create a subdirectory for each run

44 44 http://www.cs.wisc.edu/condor Organize your files and directories for big runs › Create subdirectories for each “run”  run_0, run_1, … run_599 › Create input files in each of these  run_0/simulation.in  run_1/simulation.in  …  run_599/simulation.in › The output, error & log files for each job will be created by Condor from your job’s output

45 45 http://www.cs.wisc.edu/condor Frieda’s simulation directory sim.exe sim.sub run_0 run_599 simulation.in simulation.out simulation.err simulation.log simulation.in simulation.out simulation.err simulation.log

46 46 http://www.cs.wisc.edu/condor Submit Description File for 600 Jobs # Cluster of 600 jobs with different directories Universe = vanilla Executable = sim Log = simulation.log... Arguments = -x 0 InitialDir = run_0 ·Log, input, output & error files -> run_0 Queue ·Job 3.0 (Cluster 3, Process 0) Arguments = -x 1 InitialDir = run_1 ·Log, input, output & error files -> run_1 Queue ·Job 3.1 (Cluster 3, Process 1) ·Do this 598 more times…………

47 47 http://www.cs.wisc.edu/condor Submit File for a Big Cluster of Jobs › We just submitted 1 cluster with 600 processes › All the input/output files will be in different directories › The submit file is pretty unwieldy (over 1200 lines) › Isn’t there a better way?

48 48 http://www.cs.wisc.edu/condor Submit File for a Big Cluster of Jobs (the better way) #1 › We can queue all 600 in 1 “Queue” command  Queue 600 › Condor provides $(Process) and $(Cluster)  $(Process) will be expanded to the process number for each job in the cluster 0, 1, … 599  $(Cluster) will be expanded to the cluster number Will be 4 for all jobs in this cluster

49 49 http://www.cs.wisc.edu/condor Submit File for a Big Cluster of Jobs (the better way) #2 › The initial directory for each job can be specified using $(Process)  InitialDir = run_$(Process)  Condor will expand these to “ run_0 ”, “ run_1 ”, … “ run_599 ” directories › Similarly, arguments can be variable  Arguments = -x $(Process)  Condor will expand these to “-x 0”, “-x 1”, … “-x 599”

50 50 http://www.cs.wisc.edu/condor Better Submit File for 600 Jobs # Example condor_submit input file that defines # a cluster of 600 jobs with different directories Universe = vanilla Executable = my_job Log = my_job.log Input = my_job.in Output = my_job.out Error = my_job.err Arguments = –x $(Process) ·–x 0, -x 1, … -x 599 InitialDir = run_$(Process) ·run_0 … run_599 Queue 600 ·Jobs 4.0 … 4.599

51 51 http://www.cs.wisc.edu/condor Now, we submit it… $ condor_submit my_job.submit Submitting job(s)............................................................................................................................................................................................................................................................... Logging submit event(s)............................................................................................................................................................................................................................................................... 600 job(s) submitted to cluster 4.

52 52 http://www.cs.wisc.edu/condor And, Check the queue $ condor_q -- Submitter: x.cs.wisc.edu : : x.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 frieda 4/20 12:08 0+00:00:05 R 0 9.8 my_job -arg1 –x 0 4.1 frieda 4/20 12:08 0+00:00:03 I 0 9.8 my_job -arg1 –x 1 4.2 frieda 4/20 12:08 0+00:00:01 I 0 9.8 my_job -arg1 –x 2 4.3 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 3... 4.598 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 598 4.599 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 599 600 jobs; 599 idle, 1 running, 0 held

53 53 http://www.cs.wisc.edu/condor Removing jobs › If you want to remove a job from the Condor queue, you use condor_rm › You can only remove jobs that you own › Privileged user can remove any jobs  “root” on UNIX  “administrator” on Windows

54 54 http://www.cs.wisc.edu/condor Removing jobs (continued) › Remove an entire cluster:  condor_rm 4 ·Removes the whole cluster › Remove a specific job from a cluster:  condor_rm 4.0 ·Removes a single job › Or, remove all of your jobs with “-a”  condor_rm -a ·Removes all jobs / clusters

55 55 http://www.cs.wisc.edu/condor The Boss says Frieda can add her co-workers’ desktop machines into her Condor pool as well… but only if they can also submit jobs. (Boss Fat Cat) Good News

56 56 http://www.cs.wisc.edu/condor Adding nodes › Frieda installs Condor on the desktop machines, and configures them with her machine as the central manager  The central manager: Central repository for the whole pool Performs job / machine matching, etc. › These are “non-dedicated” nodes, meaning that they can't always run Condor jobs

57 57 http://www.cs.wisc.edu/condor your workstation personal Condor 600 Condor jobs Condor Pool

58 58 http://www.cs.wisc.edu/condor condor_status % condor_status Name OpSys Arch State Activ LoadAv Mem ActvtyTime antipholus.cs LINUX INTEL Unclaimed Idle 0.020 511 0+02:28:42 coral.cs.wisc LINUX INTEL Claimed Busy 0.990 511 0+01:27:21 doc.cs.wisc.e LINUX INTEL Unclaimed Idle 0.260 511 0+00:20:04 dsonokwa.cs.w LINUX INTEL Claimed Busy 0.810 511 0+00:01:45 ferdinand.cs. LINUX INTEL Claimed Suspe 1.130 511 0+00:00:55 vm1@pinguino. LINUX INTEL Unclaimed Idle 0.000 255 0+01:03:28 vm2@pinguino. LINUX INTEL Unclaimed Idle 0.190 255 0+01:03:29

59 59 http://www.cs.wisc.edu/condor How can my jobs access their data files?

60 60 http://www.cs.wisc.edu/condor Access to Data in Condor › Use shared filesystem if available › No shared filesystem?  Condor can transfer files Can automatically send back changed files Atomic transfer of multiple files Can be encrypted over the wire  Remote I/O Socket  Standard Universe can use remote system calls (more on this later)

61 61 http://www.cs.wisc.edu/condor Condor File Transfer › ShouldTransferFiles = YES  Always transfer files to execution site › ShouldTransferFiles = NO  Rely on a shared filesystem › ShouldTransferFiles = IF_NEEDED  Will automatically transfer the files if the submit and execute machine are not in the same FileSystemDomain Universe = vanilla Executable = my_job Log = my_job.log ShouldTransferFiles = IF_NEEDED Transfer_input_files = dataset.$(Process), common.data Transfer_output_files = TheAnswer.dat Queue 600

62 62 http://www.cs.wisc.edu/condor We Always Want More Condor is managing and running our jobs, but  Our CPU requirements are greater than our resources  Jobs are preempted more often than we like

63 63 http://www.cs.wisc.edu/condor Happy Day! Frieda’s organization purchased a Dedicated Cluster! › Frieda Installs Condor on all the dedicated Cluster nodes › Frieda also adds a dedicated central manager › She configures her entire pool with this new host as the central manager…

64 64 http://www.cs.wisc.edu/condor Condor Pool With the additional resources, Frieda and her co-workers can get their jobs completed even faster. 600 Condor jobs Frieda’s Condor Pool Dedicated Cluster

65 65 http://www.cs.wisc.edu/condor Another Universe

66 66 http://www.cs.wisc.edu/condor Frieda submits some MPI jobs universe = parallel executable = mp1script arguments = my_mpich_linked_executable arg1 arg2 machine_count = 4 should_transfer_files = yes when_to_transfer_output = on_exit transfer_input_files = my_mpich_linked_executable queue

67 67 http://www.cs.wisc.edu/condor What Condor Daemons are running on my machine, and what do they do?

68 68 http://www.cs.wisc.edu/condor condor_master › Starts up all other Condor daemons › If there are any problems and a daemon exits, it restarts the daemon and sends email to the administrator › Acts as the server for many Condor remote administration commands:  condor_reconfig, condor_restart, condor_off, condor_on, condor_config_val, etc.

69 69 http://www.cs.wisc.edu/condor Condor Daemon Layout Personal Condor / Central Manager Master collector negotiator schedd startd = Process Spawned

70 70 http://www.cs.wisc.edu/condor Central Manager: condor_collector › Central manager: central repository and match maker for whole pool › Collects information from all other Condor daemons in the pool  “Directory Service” / Database for a Condor pool › Each daemon sends a periodic update called a “ClassAd” to the collector › Services queries for information:  Queries from other Condor daemons  Queries from users (condor_status) › Only on the Central Manager › At least one collector per pool

71 71 http://www.cs.wisc.edu/condor Condor Pool Layout: Collector = ClassAd Communication Pathway = Process Spawned Central Manager Master Collector

72 72 http://www.cs.wisc.edu/condor Central Manager: condor_negotiator › Performs “matchmaking” in Condor › Each “Negotiation Cycle” (typically 5 minutes):  Gets information from the collector about all available machines and all idle jobs  Tries to match jobs with machines that will serve them  Both the job and the machine must satisfy each other’s requirements › Only one negotiator per pool › Only on the Central Manager

73 73 http://www.cs.wisc.edu/condor Condor Pool Layout: Negotiator = ClassAd Communication Pathway = Process Spawned Central Manager Master Collector negotiator

74 74 http://www.cs.wisc.edu/condor Execute Hosts: condor_startd › Execute host: machines that run user jobs › Represents a machine to the Condor system › Responsible for starting, suspending, and stopping jobs › Enforces the wishes of the machine owner (the owner’s “policy”… more on this in the administrator’s tutorial) › Creates a “starter” for each running job › One startd runs on each execute node

75 75 http://www.cs.wisc.edu/condor Condor Pool Layout: startd = ClassAd Communication Pathway = Process Spawned Cluster Node Master startd Cluster Node Master startd Central Manager Master Collector negotiator

76 76 http://www.cs.wisc.edu/condor Submit Hosts: condor_schedd › Submit hosts: machines that users can submit jobs on › Maintains the persistent queue of jobs › Responsible for contacting available machines and sending them jobs › Services user commands which manipulate the job queue:  condor_submit,condor_rm, condor_q, condor_hold, condor_release, condor_prio, … › Creates a “shadow” for each running job › One schedd runs on each submit host

77 77 http://www.cs.wisc.edu/condor Condor Pool Layout: schedd = ClassAd Communication Pathway = Process Spawned Cluster Node Master startd Cluster Node Master startd Desktop Master startd Desktop Master startd Central Manager Master Collector negotiator schedd negotiator schedd

78 78 http://www.cs.wisc.edu/condor Condor Pool Layout: master = ClassAd Communication Pathway = Process Spawned Central Manager Master Collector Cluster Node Master startd Cluster Node Master startd Desktop Master startd schedd Desktop Master startd schedd negotiator schedd negotiator schedd Master

79 79 http://www.cs.wisc.edu/condor Now what? › Some of the machines in the pool can’t run my jobs  Not enough RAM  Not enough scratch disk space  Required software not installed  Etc.

80 80 http://www.cs.wisc.edu/condor Specify Requirements › An expression (syntax similar to C or Java) › Must evaluate to True for a match to be made Universe = vanilla Executable = my_job Log = my_job.log InitialDir = run_$(Process) Requirements = Memory >= 256 && Disk > 10000 Queue 600

81 81 http://www.cs.wisc.edu/condor Advanced Requirements › Requirements can match custom attributes in your Machine Ad  Can be added by hand to each machine  Or, automatically using the “Hawkeye” mechanism Universe = vanilla Executable = $$(MyProgPath) Log = my_job.log InitialDir = run_$(Process) Requirements = Memory >= 256 && Disk > 10000 \ && ( MyProgPath =!= UNDEFINED) ) Queue 600

82 82 http://www.cs.wisc.edu/condor And, Specify Rank › All matches which meet the requirements can be sorted by preference with a Rank expression. › Higher the Rank, the better the match Universe = vanilla Executable = my_job Log = my_job.log Arguments = -arg1 –arg2 InitialDir = run_$(Process) Requirements = Memory >= 256 && Disk > 10000 Rank = (KFLOPS*10000) + Memory Queue 600

83 83 http://www.cs.wisc.edu/condor My jobs aren’t running!!

84 84 http://www.cs.wisc.edu/condor Check the queue › Check the queue with condor_q: bash-2.05a$ condor_q -- Submitter: x.cs.wisc.edu : :x.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 0 5.1 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 1 5.2 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 2 5.3 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 3 5.4 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 4 5.5 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 5 5.6 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 6 5.7 frieda 4/20 12:23 0+00:00:00 I 0 9.8 my_job -arg1 –n 7 6.0 frieda 4/20 13:22 0+00:00:00 H 0 9.8 my_job -arg1 –arg2 8 jobs; 8 idle, 0 running, 1 held

85 85 http://www.cs.wisc.edu/condor Look at jobs on hold % condor_q –hold -- Submiter: x.cs.wisc.edu : :x.cs.wisc.edu ID OWNER HELD_SINCE HOLD_REASON 6.0 frieda 4/20 13:23 Error from starter on skywalker.cs.wisc.edu 9 jobs; 8 idle, 0 running, 1 held Or, See full details for a job % condor_q –l 6.0

86 86 http://www.cs.wisc.edu/condor Check machine status › Verify that there are idle machines with condor_status: bash-2.05a$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime vm1@tonic.c LINUX INTEL Claimed Busy 0.000 501 0+00:00:20 vm2@tonic.c LINUX INTEL Claimed Busy 0.000 501 0+00:00:19 vm3@tonic.c LINUX INTEL Claimed Busy 0.040 501 0+00:00:17 vm4@tonic.c LINUX INTEL Claimed Busy 0.000 501 0+00:00:05 Total Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 4 0 4 0 0 0 Total 4 0 4 0 0 0

87 87 http://www.cs.wisc.edu/condor Look in Job Log › Look in your job log for clues: bash-2.05a$ cat my_job.log 000 (031.000.000) 04/20 14:47:31 Job submitted from host:... 007 (031.000.000) 04/20 15:02:00 Shadow exception! Error from starter on gig06.stat.wisc.edu: Failed to open '/scratch.1/frieda/workspace/v67/condor- test/test3/run_0/my_job.in' as standard input: No such file or directory (errno 2) 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job...

88 88 http://www.cs.wisc.edu/condor Still not running? Exercise a little patience › On a busy pool, it can take a while to match and start your jobs › Wait at least a negotiation cycle or two (typically a few minutes)

89 89 http://www.cs.wisc.edu/condor Look to condor_q for help: condor_q -analyze bash-2.05a$ condor_q -ana 29 --- 029.000: Run analysis summary. Of 1243 machines, 1243 are rejected by your job's requirements 0 are available to run your job WARNING: Be advised: No resources matched request's constraints Check the Requirements expression below: Requirements = ((Memory > 8192)) && (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= DiskUsage) && (TARGET.FileSystemDomain == MY.FileSystemDomain)

90 90 http://www.cs.wisc.edu/condor Better analysis (Linux only): condor_q –better-analyze bash-2.05a$ condor_q -better-ana 29 The Requirements expression for your job is: ( ( target.Memory > 8192 ) ) && ( target.Arch == "INTEL" ) && ( target.OpSys == "LINUX" ) && ( target.Disk >= DiskUsage ) && ( TARGET.FileSystemDomain == MY.FileSystemDomain ) Condition Machines Matched Suggestion --------- ---------------- ---------- 1 ( ( target.Memory > 8192 ) ) 0 MODIFY TO 4000 2 ( TARGET.FileSystemDomain == "cs.wisc.edu" )584 3 ( target.Arch == "INTEL" ) 1078 4 ( target.OpSys == "LINUX" ) 1100 5 ( target.Disk >= 13 ) 1243

91 91 http://www.cs.wisc.edu/condor Learn about available resources: bash-2.05a$ condor_status –const 'Memory > 8192' (no output means no matches) bash-2.05a$ condor_status -const 'Memory > 4096' Name OpSys Arch State Activ LoadAv Mem ActvtyTime vm1@s0-03.cs. LINUX X86_64 Unclaimed Idle 0.000 5980 1+05:35:05 vm2@s0-03.cs. LINUX X86_64 Unclaimed Idle 0.000 5980 13+05:37:03 vm1@s0-04.cs. LINUX X86_64 Unclaimed Idle 0.000 7988 1+06:00:05 vm2@s0-04.cs. LINUX X86_64 Unclaimed Idle 0.000 7988 13+06:03:47 Total Owner Claimed Unclaimed Matched Preempting X86_64/LINUX 4 0 0 4 0 0 Total 4 0 0 4 0 0

92 92 http://www.cs.wisc.edu/condor Job Policy Expressions › User can supply job policy expressions in the submit file. › Can be used to describe a successful run. on_exit_remove = on_exit_hold = periodic_remove = periodic_hold =

93 93 http://www.cs.wisc.edu/condor Job Policy Examples › Do not remove if exits with a signal:  on_exit_remove = ExitBySignal == False › Place on hold if exits with nonzero status or ran for less than an hour:  on_exit_hold = ( (ExitBySignal==False) && (ExitSignal != 0) ) || ( (ServerStartTime - JobStartDate) < 3600) › Place on hold if job has spent more than 50% of its time suspended:  periodic_hold = CumulativeSuspensionTime > (RemoteWallClockTime / 2.0)

94 94 http://www.cs.wisc.edu/condor Insert ClassAd attributes › Special purpose usage › In the submit description file, introduce an attribute for the job +Department = “biochemistry” causes the ClassAd to contain Department = “biochemistry”

95 95 http://www.cs.wisc.edu/condor We’ve seen how Condor can: Keep an eye on your jobs and will keep you posted on their progress Implement your policy on the execution order of the jobs Keep a log of your job activities

96 96 http://www.cs.wisc.edu/condor Another Universe

97 97 http://www.cs.wisc.edu/condor My new jobs run for 20 days… › What happens when a job is forced off it’s CPU?  Preempted by higher priority user or job  Vacated because of user activity › How can I add fault tolerance to my jobs?

98 98 http://www.cs.wisc.edu/condor Condor’s Standard Universe to the rescue! › Support for transparent process checkpoint and restart › Remote system calls (remote I/O)  Your job can read / write files as if they were local

99 99 http://www.cs.wisc.edu/condor Remote System Calls in the Standard Universe › I/O system calls are trapped and sent back to the submit machine Examples: open a file, write to a file › No source code changes typically required › Programming language independent

100 100 http://www.cs.wisc.edu/condor Process Checkpointing in the Standard Universe › Condor’s process checkpointing provides a mechanism to automatically save the state of a job › The process can then be restarted from right where it was checkpointed  After preemption, crash, etc.

101 101 http://www.cs.wisc.edu/condor Checkpointing: Process Starts checkpoint: the entire state of a program, saved in a file  CPU registers, memory image, I/O time

102 102 http://www.cs.wisc.edu/condor Checkpointing: Process Checkpointed time 123

103 103 http://www.cs.wisc.edu/condor Checkpointing: Process Killed time 3 3 Killed!

104 104 http://www.cs.wisc.edu/condor Checkpointing: Process Resumed time 3 3 goodputbadput goodput

105 105 http://www.cs.wisc.edu/condor When will Condor checkpoint your job? › Periodically, if desired  For fault tolerance › When your job is preempted by a higher priority job › When your job is vacated because the execution machine becomes busy › When you explicitly run condor_checkpoint, condor_vacate, condor_off or condor_restart command

106 106 http://www.cs.wisc.edu/condor Making the Standard Universe Work › The job must be relinked with Condor’s standard universe support library › To relink, place condor_compile in front of the command used to link the job: % condor_compile gcc -o myjob myjob.c - OR - % condor_compile f77 -o myjob filea.f fileb.f - OR - % condor_compile make –f MyMakefile

107 107 http://www.cs.wisc.edu/condor Limitations of the Standard Universe › Condor’s checkpointing is not at the kernel level.  Standard Universe the job may not: Fork() Use kernel threads Use some forms of IPC, such as pipes and shared memory › Must have access to source code to relink › Many typical scientific jobs are OK

108 108 http://www.cs.wisc.edu/condor Connecting Condors › Frieda knows people with their own Condor pools, and gets permission to use their computing resources… › How can Condor help her do this?

109 109 http://www.cs.wisc.edu/condor Connect Condors with Flocking flock › Frieda configures her Condor pool to “flock” to her friend’s pool. › Flocking is a Condor-specific technology.

110 110 http://www.cs.wisc.edu/condor Condor Pool 600 Condor jobs Frieda’s Condor Pool Friendly Condor Pool

111 111 http://www.cs.wisc.edu/condor Frieda meets The Grid › Frieda also has access to grid resources she wants to use  She has certificates and access to Globus or other resources at remote institutions › But Frieda wants Condor’s queue management features for her jobs! › She installs Condor so she can submit “Grid Universe” jobs to Condor

112 112 http://www.cs.wisc.edu/condor “Grid” Universe › All handled in your submit file › Supports a number of “back end” types:  Globus: GT2, GT4  NorduGrid  UNICORE  Condor  PBS  LSF  EC2  NQS

113 113 http://www.cs.wisc.edu/condor Grid Universe & Globus 2 › Used for a Globus GT2 back-end  “Condor-G” › Format: Grid_Resource = gt2 Head-Node Globus_rsl = › Example: Universe = grid Grid_Resource = gt2 beak.cs.wisc.edu/jobmanager Globus_rsl = (queue=long)(project=atom-smasher)

114 114 http://www.cs.wisc.edu/condor Grid Universe & Globus 4 › Used for a Globus GT4 back-end › Format: Grid_Resource = gt4 Globus_XML = › Example: Universe = grid Grid_Resource = gt4 beak.cs.wisc.edu Condor Globus_xml = long atom- smasher

115 115 http://www.cs.wisc.edu/condor Grid Universe & Condor › Used for a Condor back-end  “Condor-C” › Format: Grid_Resource = condor Remote_ =  “Remote_” part is stripped off › Example: Universe = grid Grid_Resource = condor beak condor.cs.wisc.edu Remote_Universe = standard

116 116 http://www.cs.wisc.edu/condor Grid Universe & NorduGrid › Used for a NorduGrid back-end Grid_Resource = nordugrid › Example: Universe = grid Grid_Resource = nordugrid ngrid.cs.wisc.edu

117 117 http://www.cs.wisc.edu/condor Grid Universe & UNICORE › Used for a UNICORE back-end › Format: Grid_Resource = unicore › Example: Universe = grid Grid_Resource = unicore uhost.cs.wisc.edu vhost

118 118 http://www.cs.wisc.edu/condor Grid Universe & PBS › Used for a PBS back-end › Format: Grid_Resource = pbs › Example: Universe = grid Grid_Resource = pbs

119 119 http://www.cs.wisc.edu/condor Grid Universe & LSF › Used for a LSF back-end › Format: Grid_Resource = lsf › Example: Universe = grid Grid_Resource = lsf

120 120 http://www.cs.wisc.edu/condor Credential Management › Condor will do The Right Thing™ with your X509 certificate and proxy › Override default proxy:  X509UserProxy = /home/frieda/other/proxy › Proxy may expire before jobs finish executing  Condor can use MyProxy to renew your proxy  When a new proxy is available, Condor will forward the renewed proxy to the job  This works for non-grid jobs, too

121 121 http://www.cs.wisc.edu/condor My jobs have have dependencies… Can Condor help solve my dependency problems?

122 122 http://www.cs.wisc.edu/condor Condor Universes: Scheduler and Local › Scheduler Universe  Plug in a meta-scheduler  Developed for DAGMan (more later)  Similar to Globus’s fork job manager › Local  Very similar to vanilla, but jobs run on the local host  Has more control over jobs than scheduler universe

123 123 http://www.cs.wisc.edu/condor Frieda learns DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. › (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

124 124 http://www.cs.wisc.edu/condor What is a DAG? › A DAG is the data structure used by DAGMan to represent these dependencies. › Each job is a “node” in the DAG. › Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job B Job C Job D

125 125 http://www.cs.wisc.edu/condor Defining a DAG › A DAG is defined by a.dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D › each node will run the Condor job specified by its accompanying Condor submit file Job A Job BJob C Job D

126 126 http://www.cs.wisc.edu/condor Submitting a DAG › To start your DAG, just run condor_submit_dag with your.dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond.dag › condor_submit_dag is run by the schedd  DAGMan daemon itself is “watched” by Condor, so you don’t have to

127 127 http://www.cs.wisc.edu/condor DAGMan Running a DAG › DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. Condor Job Queue B C D A A.dag File

128 128 http://www.cs.wisc.edu/condor DAGMan Running a DAG (cont’d) › DAGMan holds & submits jobs to the Condor queue at the appropriate times. Condor Job Queue D B C B A C

129 129 http://www.cs.wisc.edu/condor Running a DAG (cont’d) › In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. Condor Job Queue DAGMan X D A B Rescue File

130 130 http://www.cs.wisc.edu/condor Recovering a DAG › Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. Condor Job Queue Rescue File C DAGMan D A B C

131 131 http://www.cs.wisc.edu/condor DAGMan Recovering a DAG (cont’d) › Once that job completes, DAGMan will continue the DAG as if the failure never happened. Condor Job Queue C D A B D

132 132 http://www.cs.wisc.edu/condor DAGMan Finishing a DAG › Once the DAG is complete, the DAGMan job itself is finished, and exits. Condor Job Queue C D A B

133 133 http://www.cs.wisc.edu/condor Additional DAGMan Features › Provides other handy features for job management…  nodes can have PRE & POST scripts  failed nodes can be automatically re- tried a configurable number of times  job submission can be “throttled”

134 134 http://www.cs.wisc.edu/condor General User Commands › condor_status View Pool Status › condor_qView Job Queue › condor_submitSubmit new Jobs › condor_rmRemove Jobs › condor_prioIntra-User Prios › condor_historyCompleted Job Info › condor_submit_dagSubmit new DAG › condor_checkpointForce a checkpoint › condor_compileLink Condor library

135 135 http://www.cs.wisc.edu/condor Condor Job Universes Vanilla Universe Standard Universe Grid Universe Scheduler Universe Local Universe Virtual Machine Universe Java Universe Parallel Universe MPICH-1 MPICH-2 LAM …

136 136 http://www.cs.wisc.edu/condor Why have a special Universe for Java jobs? › Java Universe provides more than just inserting “java” at the start of the execute line of a vanilla job:  Knows which machines have a JVM installed  Knows the location, version, and performance of JVM on each machine  Knows about jar files, etc.  Provides more information about Java job completion than just JVM exit code Program runs in a Java wrapper, allowing Condor to report Java exceptions, etc.

137 137 http://www.cs.wisc.edu/condor Universe Java Job › Example Java Universe Submit file: Universe = java Executable = Main.class jar_files = MyLibrary.jar Input = infile Output = outfile Arguments = Main 1 2 3 Queue

138 138 http://www.cs.wisc.edu/condor Java support, cont. bash-2.05a$ condor_status –java Name JavaVendor Ver State Actv LoadAv Mem abulafia.cs Sun Microsy 1.5.0_ Claimed Busy 0.180 503 acme.cs.wis Sun Microsy 1.5.0_ Unclaimed Idle 0.000 503 adelie01.cs Sun Microsy 1.5.0_ Claimed Busy 0.000 1002 adelie02.cs Sun Microsy 1.5.0_ Claimed Busy 0.000 1002 … Total Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 965 179 516 250 20 0 INTEL/WINNT50 102 6 65 31 0 0 SUN4u/SOLARIS28 1 0 0 1 0 0 X86_64/LINUX 128 2 106 20 0 0 Total 1196 187 687 302 20 0

139 139 http://www.cs.wisc.edu/condor Frieda wants Condor features on remote resources › She wants to run standard universe jobs on Grid-managed resources  For matchmaking and dynamic scheduling of jobs  For job checkpointing and migration  For remote system calls

140 140 http://www.cs.wisc.edu/condor Condor GlideIn › Frieda can use the Grid Universe to run Condor daemons on Grid resources › When the resources run these GlideIn jobs, they will temporarily join her Condor Pool › She can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the remote resources › Currently only supports Globus GT2  We hope to fix this limitation

141 141 http://www.cs.wisc.edu/condor your workstation Friendly Condor Pool personal Condor 600 Condor jobs Globus Grid PBS LSF Condor Condor Pool glide-in jobs

142 142 http://www.cs.wisc.edu/condor How It Works Manager LSF User Job Startd Personal CondorRemote Resource Condor jobs GlideIn jobs Starter ScheddCollector & Negotiator Grid Manager Shadow Master

143 143 http://www.cs.wisc.edu/condor GlideIn Concerns › What if the remote resource kills my GlideIn job?  That resource will disappear from your pool and your jobs will be rescheduled on other machines  Standard universe jobs will resume from their last checkpoint like usual › What if all my jobs are completed before a GlideIn job runs?  If a GlideIn Condor daemon is not matched with a job in 10 minutes, it terminates, freeing the resource

144 144 http://www.cs.wisc.edu/condor In Review With Condor’s help, Frieda can:  Manage her compute job workload  Access local machines  Access remote Condor Pools via flocking  Access remote compute resources on the Grid via “Grid Universe” jobs  Carve out her own personal Condor Pool from the Grid with GlideIn technology

145 145 http://www.cs.wisc.edu/condor Administrator Commands › condor_vacateLeave a machine now › condor_onStart Condor › condor_offStop Condor › condor_reconfigReconfig on-the-fly › condor_config_valView/set config › condor_userprioUser Priorities › condor_statsView detailed usage accounting stats

146 146 http://www.cs.wisc.edu/condor My boss wants to watch what Condor is doing

147 147 http://www.cs.wisc.edu/condor Use CondorView! › Provides visual graphs of current and past utilization › Data is derived from Condor's own accounting statistics › Interactive Java applet › Quickly and easily view:  How much Condor is being used  How many cycles are being delivered  Who is using them  Utilization by machine platform or by user

148 148 http://www.cs.wisc.edu/condor CondorView Usage Graph

149 149 http://www.cs.wisc.edu/condor I could also talk lots about… › GCB: Living with firewalls & private networks › Federated Grids/Clusters › APIs and Portals › MW › Database Support (Quill) › High Availability Fail-over › Compute On-Demand (COD) › Dynamic Pool Creation (“Glide-in”) › Role-based prioritization and accounting › Strong security, incl privilege separation › Data movement scheduling in workflows › …

150 150 http://www.cs.wisc.edu/condor Thank you! Check us out on the Web: http://www.condorproject.org Email: condor-admin@cs.wisc.edu


Download ppt "1 Using Condor An Introduction Condor Week 2008."

Similar presentations


Ads by Google