More HTCondor Monday AM, Lecture 2 Ian Ross

Slides:



Advertisements
Similar presentations
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Advertisements

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
When and How to Use Large-Scale Computing: CHTC and HTCondor Lauren Michael, Research Computing Facilitator Center for High Throughput Computing STAT 692,
Introduction to Condor DMD/DFS J.Knudstrup December 2005.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.
An Introduction to High-Throughput Computing Rob Quick OSG Operations Officer Indiana University Some Content Contributed by the University of Wisconsin.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Sun Grid Engine. Grids Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:
Grid Computing I CONDOR.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Grid job submission using HTCondor Andrew Lahiff.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
An Introduction to Using
Profiling Applications to Choose the Right Computing Infrastructure plus Batch Management with HTCondor Kyle Gross – Operations Support.
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Introduction to High Throughput Computing and HTCondor Monday AM, Lecture 1 Ian Ross Center for High Throughput Computing University of Wisconsin-Madison.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Getting the Most out of HTC with Workflows Friday Christina Koch Research Computing Facilitator University of Wisconsin.
Tutorial 2: Homework 1 and Project 1
Debugging Common Problems in HTCondor
Condor DAGMan: Managing Job Dependencies with Condor
HTCondor Security Basics
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Intermediate HTCondor: Workflows Monday pm
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Things you may not know about HTCondor
High Availability in HTCondor
Monitoring HTCondor with Ganglia
Building Grids with Condor
Condor: Job Management
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Troubleshooting Your Jobs
Accounting, Group Quotas, and User Priorities
Submitting Many Jobs at Once
Haiyan Meng and Douglas Thain
Job Matching, Handling, and Other HTCondor Features
Basic Grid Projects – Condor (Part I)
Process & its States Lecture 5.
Introduction to High Throughput Computing and HTCondor
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
The Condor JobRouter.
Sun Grid Engine.
Condor: Firewall Mirroring
Late Materialization has (lately) materialized
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Thursday AM, Lecture 1 Lauren Michael
Presentation transcript:

More HTCondor Monday AM, Lecture 2 Ian Ross Center for High Throughput Computing University of Wisconsin-Madison

Questions so far?

Goals for this Session Understand the mechanisms of HTCondor (and HTC in general) a bit more deeply Use a few more HTCondor features Run more (and more complex) jobs at once

Why is HTC Difficult? System must track jobs, machines, policy, … System must recover gracefully from failures Try to use all available resources, all the time Lots of variety in users, machines, networks, ... Sharing is hard (e.g. Policy, security)

Main Parts of HTCondor

Main Parts of HTCondor Function Track waiting/running jobs Track available machines Match jobs and machines Manage one machine Manage one job (on submitter) Manage one job (on machine)

Main Parts of HTCondor Function HTCondor Name Track waiting/running jobs schedd (“sked-dee”) Track available machines collector Match jobs and machines negotiator Manage one machine startd (“start-dee”) Manage one job (on submitter) shadow Manage one job (on machine) starter

Main Parts of HTCondor Function HTCondor Name # Track waiting/running jobs schedd (“sked-dee”) 1+ Track available machines collector 1 Match jobs and machines negotiator Manage one machine startd (“start-dee”) per machine Manage one job (on submitter) shadow per job running Manage one job (on machine) starter

(negotiator + collector) HTCondor Matchmaking HTCondor’s central manager matches jobs with computers Information about each job and computer in a “ClassAd” ClassAds have the format: AttributeName = value Value can be a boolean, number, or string execute submit execute central manager (negotiator + collector) execute

Default HTCondor configuration Job ClassAd Submit file RequestCpus = 1 Err = "job.err" WhenToTransferOutput = "ON_EXIT" TargetType = "Machine" Cmd = "/home/alice/tests/htcondor_week/compare_states" JobUniverse = 5 Iwd = "/home/alice/tests/htcondor_week” NumJobStarts = 0 WantRemoteIO = true OnExitRemove = true TransferInput = "us.dat,wi.dat" MyType = "Job" Out = "job.out" UserLog = "/home/alice/tests/htcondor_week/job.log" RequestMemory = 20 ... executable = compare_states arguments = wi.dat us.dat wi.dat.out should_transfer_files = YES transfer_input_files = us.dat, wi.dat when_to_transfer_output = ON_EXIT log = job.log output = job.out error = job.err request_cpus = 1 request_disk = 20MB request_memory = 20MB queue 1 = + Default HTCondor configuration

Default HTCondor configuration Machine ClassAd HasFileTransfer = true DynamicSlot = true TotalSlotDisk = 4300218.0 TargetType = "Job" TotalSlotMemory = 2048 Mips = 17902 Memory = 2048 UtsnameSysname = "Linux" MAX_PREEMPT = ( 3600 * ( 72 - 68 * ( WantGlidein =?= true ) ) ) Requirements = ( START ) && ( IsValidCheckpointPlatform ) && ( WithinResourceLimits ) OpSysMajorVer = 6 TotalMemory = 9889 HasGluster = true OpSysName = "SL" HasDocker = true ... = + Default HTCondor configuration

(negotiator + collector) HTCondor Matchmaking On a regular basis, the central manager reviews job and Machine ClassAds and matches jobs to computers execute submit execute central manager (negotiator + collector) execute

Job submission, revisited

Waiting for matchmaking Back to our compare_states example: condor_submit job.submit # job is submitted to the queue! condor_q # let’s check the status! Matchmaking does not happen instantaneously! We need to wait for the periodic matchmaking cycle (on the order of 5 minutes) -- Schedd: learn.chtc.wisc.edu : <...> : ... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 8.0 cat 11/12 09:30 0+00:00:00 I 0 0.0 compare_states 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

Job Idle Submit Node (submit_dir)/ job.submit compare_states wi.dat -- Schedd: learn.chtc.wisc.edu : <...> : ... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 8.0 cat 11/12 09:30 0+00:00:00 I 0 0.0 compare_states 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Submit Node (submit_dir)/ job.submit compare_states wi.dat us.dat job.log job.out job.err

Match made! Job Starts Submit Node Execute Node (submit_dir)/ -- Schedd: learn.chtc.wisc.edu : <...> : ... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 8.0 cat 11/12 09:30 0+00:00:00 < 0 0.0 compare_states 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Submit Node Execute Node (submit_dir)/ job.submit compare_states wi.dat us.dat job.log job.out job.err (execute_dir)/ Transfer: compare_states wi.dat us.dat

Job Running Submit Node Execute Node (submit_dir)/ job.submit -- Schedd: learn.chtc.wisc.edu : <...> : ... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 8.0 cat 11/12 09:30 0+00:00:00 R 0 0.0 compare_states 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Submit Node Execute Node (submit_dir)/ job.submit compare_states wi.dat us.dat job.log job.out job.err (execute_dir)/ compare_states wi.dat us.dat stderr stdout wi.dat.out

Job Completes Submit Node Execute Node (submit_dir)/ job.submit -- Schedd: learn.chtc.wisc.edu : <...> : ... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 8.0 cat 11/12 09:30 0+00:00:00 > 0 0.0 compare_states 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Submit Node Execute Node (submit_dir)/ job.submit compare_states wi.dat us.dat job.log job.out job.err (execute_dir)/ compare_states wi.dat us.dat stderr stdout wi.dat.out stderr stdout wi.dat.out

Job Completes Submit Node (submit_dir)/ job.submit compare_states -- Schedd: learn.chtc.wisc.edu : <...> : ... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Submit Node (submit_dir)/ job.submit compare_states wi.dat us.dat job.log job.out job.err wi.dat.out

HTCondor Priorities Job priority User priority Preemption Set per job by the user (owner) Relative to that user’s other jobs Set in submit file or changed later with condor_prio Higher number means run sooner User priority Computed based on past usage Determines user’s “fair share” percentage of slots Lower number means run sooner (0.5 is minimum) Preemption Low priority jobs stopped for high priority ones (stopped jobs go back into the regular queue) Governed by fair-share algorithm and pool policy Not enabled on all pools

Submit Files

File Access in HTCondor Option 1: Shared filesystem Easy to use (jobs just access files) But, must exist and be ready to handle load Option 2: HTCondor transfers files for you Must name all input files (except executable) May name output files; defaults to all new/changed should_transfer_files = NO should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = a.txt, b.tgz

Resource requests request_cpus = ClassAdExpression request_disk = ClassAdExpression request_memory = ClassAdExpression Be a good user! Ask for minimum resources of execute machine Check job log for actual usage!!! May be dynamically allocated (very advanced!) request_disk = 2000000 # in KB by default request_disk = 2GB # KB, MB, GB, TB request_memory = 2000 # in MB by default request_memory = 2GB # KB, MB, GB, TB

Resource requests -- Log File 000 (128.000.000) 05/09 11:09:08 Job submitted from host: <128.104.101.92&sock=6423_b881_3> ... 001 (128.000.000) 05/09 11:10:46 Job executing on host: <128.104.101.128:9618&sock=5053_3126_3> 006 (128.000.000) 05/09 11:10:54 Image size of job updated: 220 1 - MemoryUsage of job (MB) 220 - ResidentSetSize of job (KB) 005 (128.000.000) 05/09 11:12:48 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 33 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 33 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 612015 20480 17203728 Memory (MB) : 312 3000 3000

Resource requests -- Log File This user needs to update their resource requests! Using more diskspace than they request, which can cause their jobs to go on hold More on this tomorrow! Requesting more memory than they use Their jobs can take longer to match (there may be old 1GB RAM machines sitting around ready for them!) Other users might actually need all that memory and are waiting in line. 000 (128.000.000) 05/09 11:09:08 Job submitted from host: <128.104.101.92&sock=6423_b881_3> ... 001 (128.000.000) 05/09 11:10:46 Job executing on host: <128.104.101.128:9618&sock=5053_3126_3> 006 (128.000.000) 05/09 11:10:54 Image size of job updated: 220 1 - MemoryUsage of job (MB) 220 - ResidentSetSize of job (KB) 005 (128.000.000) 05/09 11:12:48 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 33 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 33 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 612015 20480 17203728 Memory (MB) : 312 3000 3000

Email notifications When to send email: Where to send email notification = Always|Complete|Error|Never When to send email: Always: job checkpoints or completes Complete: job completes (default) Error: job completes with error Never: do not send email Where to send email Defaults to user@submit-machine notify_user = email

Requirements and Rank requirements = ClassAdExpression Expression must evaluate to true to match slot HTCondor adds defaults! See HTCondor Manual (esp. 2.5.2 and 4.1) for more Ranks matching slots in order by preference Must evaluate to a floatingpoint number, higher is better False becomes 0.0, True becomes 1.0 Undefined or error values become 0.0 Writing rank expressions is an art rank = ClassAdExpression

Arbitrary Attributes Adds arbitrary attribute(s) to a job’s ClassAd +AttributeName = value Adds arbitrary attribute(s) to a job’s ClassAd Useful in (at least) 2 cases: Affect matchmaking with special attributes Report on jobs with specific attribute value Experiment with reporting during exercises!

Submitting multiple jobs

Many Jobs, One Submit File HTCondor offers many ways to submit multiple jobs from one submit file, allowing you to: Analyze multiple data files Test many parameter or input combinations Modify arguments …without having to Start each job individually Create submit files for each job

Multiple numbered input files job.submit (submit_dir)/ executable = analyze.exe arguments = file.in file.out transfer_input_files = file.in log = job.log output = job.out error = job.err queue analyze.exe file0.in file1.in file2.in job.submit Goal: create 3 jobs that each analyze a different input file.

Multiple numbered input files job.submit (submit_dir)/ executable = analyze.exe arguments = file.in file.out transfer_input_files = file.in log = job.log output = job.out error = job.err queue 3 analyze.exe file0.in file1.in file2.in job.submit Generates 3 jobs, but doesn’t change inputs and overwrites outputs So how can we specify different values to each job?

Manual Approach (Not recommnded!) job.submit executable = analyze.exe log = job.log arguments = file0.in file0.out transfer_input_files = file0.in output = job0.out error = job0.err queue 1 arguments = file1.in file1.out transfer_input_files = file1.in output = job1.out error = job1.err ... (submit_dir)/ analyze.exe file0.in file1.in file2.in job.submit

Automatic Variables queue N 128 1 2 ClusterId ProcId ... N-1 Each job’s ClusterId and ProcId numbers are saved as job attributes They can be accessed inside the submit file using: $(Cluster)* $(Proc)* ... * : $(ClusterId) and $(ProcId) are also used

Multiple numbered input files job.submit (submit_dir)/ executable = analyze.exe arguments = file$(Process).in file$(Process).out transfer_input_files = file$(Process).in log = job_$(Cluster).log output = job_$(Process).out error = job_$(Process).err queue 3 analyze.exe file0.in file1.in file2.in job.submit $(Cluster) and $(Process) allow us to provide unique values to jobs!

Separating Jobs with initialdir Initialdir changes the submission directory for each job, allowing each job to “live” in separate directories Uses the same name for all input/output files Useful for jobs with lots of output files job0 job1 job2 job3 job4

Separating jobs with initialdir (submit_dir)/ job0/ file.in job.log job.err file.out job1/ file.in job.log job.err file.out job2/ file.in job.log job.err file.out job.submit analyze.exe job.submit executable = analyze.exe initialdir = job$(ProcId) arguments = file.in file.out transfer_input_files = file.in log = job.log error = job.err queue 3 Executable should be in the directory with the submit file, *not* in the individual job directories

Many jobs per submit file Back to our compare_states example… What if we had data for each state? We could do 50 submit files (or 50 queue 1 statements) ... executable = compare_states arguments = wi.dat us.dat wi.dat.out … executable = compare_states arguments = ca.dat us.dat ca.dat.out … executable = compare_states arguments = mo.dat us.dat mo.dat.out … executable = compare_states arguments = md.dat us.dat md.dat.out … executable = compare_states arguments = wv.dat us.dat wv.dat.out … executable = compare_states arguments = fl.dat us.dat fl.dat.out …

Many jobs per submit file Back to our compare_states example… What if we had data for each state? We could do 50 submit files (or 50 queue 1 statements) ... executable = compare_states arguments = wa.dat us.dat wa.dat.out … executable = compare_states arguments = co.dat us.dat co.dat.out … executable = compare_states arguments = wi.dat us.dat wi.dat.out … executable = compare_states arguments = ca.dat us.dat ca.dat.out … executable = compare_states arguments = mi.dat us.dat mi.dat.out … executable = compare_states arguments = nv.dat us.dat nv.dat.out … executable = compare_states arguments = mo.dat us.dat mo.dat.out … executable = compare_states arguments = md.dat us.dat md.dat.out … executable = compare_states arguments = sd.dat us.dat sd.dat.out … executable = compare_states arguments = mn.dat us.dat mn.dat.out … executable = compare_states arguments = wv.dat us.dat wv.dat.out … executable = compare_states arguments = fl.dat us.dat fl.dat.out …

Many jobs per submit file Back to our compare_states example… What if we had data for each state? We could do 50 submit files (or 50 queue 1 statements) ... executable = compare_states arguments = vt.dat us.dat vt.dat.out … executable = compare_states arguments = al.dat us.dat al.dat.out … executable = compare_states arguments = wa.dat us.dat wa.dat.out … executable = compare_states arguments = co.dat us.dat co.dat.out … executable = compare_states arguments = wi.dat us.dat wi.dat.out … executable = compare_states arguments = ca.dat us.dat ca.dat.out … executable = compare_states arguments = tx.dat us.dat tx.dat.out … executable = compare_states arguments = ut.dat us.dat ut.dat.out … executable = compare_states arguments = mi.dat us.dat mi.dat.out … executable = compare_states arguments = nv.dat us.dat nv.dat.out … executable = compare_states arguments = mo.dat us.dat mo.dat.out … executable = compare_states arguments = md.dat us.dat md.dat.out … executable = compare_states arguments = ak.dat us.dat ak.dat.out … executable = compare_states arguments = tn.dat us.dat tn.dat.out … executable = compare_states arguments = sd.dat us.dat sd.dat.out … executable = compare_states arguments = mn.dat us.dat mn.dat.out … executable = compare_states arguments = wv.dat us.dat wv.dat.out … executable = compare_states arguments = fl.dat us.dat fl.dat.out …

Many jobs per submit file Back to our compare_states example… What if we had data for each state? We could do 50 submit files (or 50 queue 1 statements) ... Or we could organize our data to fit the $(Process) or initialdir approaches… Or we could use HTCondor’s queue language to submit jobs smartly!

Submitting Multiple Jobs Replace job-level files… …with a variable: ...But how do we define these variables in our queue statement? executable = compare_states arguments = wi.dat us.dat wi.dat.out … transfer_input_files = us.dat, wi.dat executable = compare_states arguments = $(state).dat us.dat $(state).dat.out … transfer_input_files = us.dat, $(state).dat queue ???

Submitting Multiple Jobs – Queue Statements multiple “queue” statements matching ... pattern in ... list from ... file state = wi.dat queue 1 state = ca.dat state = mo.dat queue state matching *.dat queue state in (wi.dat ca.dat co.dat) queue state from state_list.txt wi.dat ca.dat mo.dat state_list.txt

Submitting Multiple Jobs – Queue Statements multiple queue statements Not recommended. Can be useful when submitting job batches where a single (non-file/argument) characteristic is changing matching .. pattern Natural nested looping, minimal programming, use optional “files” and “dirs” keywords to only match files or directories Requires good naming conventions in .. list Supports multiple variables, all information contained in a single file, reproducible Harder to automate submit file creation from .. file Supports multiple variables, highly modular (easy to use one submit file for many job batches), reproducible Additional file needed

Using Multiple Variables Both the “from” and “in” syntax support multiple variables from a list. job.submit job_list.txt executable = compare_states arguments = -y $(option) -i $(file) should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = $(file) queue file,option from job_list.txt wi.dat, 2010 wi.dat, 2015 ca.dat, 2010 ca.dat, 2015 mo.dat, 2010 mo.dat, 2015

Submitting Multiple Jobs -- Advanced Variables $(Step), $(Item), $(Row), $(ItemIndex) provide additional handles when queuing multiple jobs Function Macros E.g. $Fn(state) becomes “wi” when state is “wi.dat” Python-style slicing queue state matching [:1] *.dat Only submits one job – great for testing! Lots of additional (and powerful!) features Experiment if you finish exercises early! See documentation in Section 2.5

Your Turn!

Exercises! Ask questions! Lots of instructors around Coming up: Now-12:15 Hands-on Exercises 12:15 – 1:15 Lunch 1:15 – 5:30 Afternoon sessions

Backup Slides

HTCondor Commands

List jobs: condor_q Select jobs: by user (defaults to you), cluster, job ID Format output as you like View full ClassAd(s), typically 80-90 attributes Most useful when limited to a single job ID) Ask HTCondor why a job is not running May not explain everything, but can help Remember: Negotiation happens periodically Explore condor_q options in coming exercises

List slots: condor_status Select slots: available, host, specific slot Select slots by ClassAd expression E.g. slots with SL6 and > 10 GB memory Format output as you like View full ClassAd(s), typically 120-250 attributes Most useful when limited to a single slot Explore condor_status options in exercises

HTCondor Universes Different combinations of configurations and features are bundled as universes vanilla A “normal” job; default, fine for today standard Supports checkpointing and remote I/O java Special support for Java programs parallel Supports parallel jobs (such as MPI) grid Submits to remote system (more tomorrow) …and many others

· · · Typical Architecture cvcv Central Manager Submit Execute Execute collector + negotiator Submit Execute Execute Execute schedd startd startd startd Submit · · · Execute Execute Execute schedd startd startd startd Submit Execute Execute Execute schedd startd startd startd

· · · Typical Architecture cvcv Central Manager Submit Execute Execute collector + negotiator learn.chtc.wisc.edu Submit Execute Execute Execute schedd startd startd startd Submit · · · Execute Execute Execute schedd startd startd startd Submit Execute Execute Execute schedd startd startd startd

· · · Typical Architecture cvcv Central Manager Submit Execute Execute cm.chtc.wisc.edu Central Manager collector + negotiator Submit Execute Execute Execute schedd startd startd startd Submit · · · Execute Execute Execute schedd startd startd startd Submit Execute Execute Execute schedd startd startd startd

· · · Typical Architecture cvcv Central Manager Submit Execute Execute collector + negotiator eNNN.chtc.wisc.edu Submit Execute Execute Execute schedd startd startd startd Submit · · · Execute Execute Execute schedd startd startd startd Submit Execute Execute Execute schedd startd startd startd

· · · Typical Architecture cvcv Central Manager Submit Execute Execute collector + negotiator cNNN.chtc.wisc.edu Submit Execute Execute Execute schedd startd startd startd Submit · · · Execute Execute Execute schedd startd startd startd Submit Execute Execute Execute schedd startd startd startd

The Life of an HTCondor Job Central Manager negotiator collector startd schedd Submit Machine Execute Machine

The Life of an HTCondor Job Central Manager negotiator collector Send periodic updates startd schedd Submit Machine Execute Machine

The Life of an HTCondor Job Central Manager negotiator collector Send periodic updates startd schedd 1. Submit job Submit Machine Execute Machine

The Life of an HTCondor Job Central Manager negotiator collector 2. Request job details Send periodic updates startd schedd 1. Submit job Submit Machine Execute Machine

The Life of an HTCondor Job Central Manager negotiator collector 2. Request job details 3. Send jobs Send periodic updates startd schedd 1. Submit job Submit Machine Execute Machine

The Life of an HTCondor Job Central Manager negotiator collector 2. Request job details 3. Send jobs 4. Notify of match Send periodic updates startd schedd 1. Submit job Submit Machine Execute Machine

The Life of an HTCondor Job Central Manager negotiator collector 2. Request job details 3. Send jobs 4. Notify of match Send periodic updates startd schedd 5. Claim 1. Submit job Submit Machine Execute Machine

The Life of an HTCondor Job Central Manager negotiator collector 2. Request job details 3. Send jobs 4. Notify of match Send periodic updates startd schedd 5. Claim 6. start 1. Submit job 6. start 7. Transfer exec., input shadow starter Submit Machine Execute Machine

The Life of an HTCondor Job Central Manager negotiator collector 2. Request job details 3. Send jobs 4. Notify of match Send periodic updates startd schedd 5. Claim 6. start 1. Submit job 6. start 7. Transfer exec., input shadow starter 8. Transfer output 8. start Submit Machine job Execute Machine

Matchmaking Algorithm (sort of) A. Gather lists of machines and waiting jobs B. For each user: 1. Compute maximum # of slots to allocate to user This is the user’s “fair share”, a % of whole pool 2. For each job (until maximum matches reached): A. Find all machines that are acceptable (i.e., machine and job requrements are met) B. If there are no acceptable machines, skip to next job C. Sort acceptable machines by job preferences D. Pick the best one E. Record match of job and slot

ClassAds In HTCondor, information about machines and jobs (and more) are represented by ClassAds You do not write ClassAds (much), but reading them may help understanding and debugging ClassAds can represent persistent fact, current state, preferences, requirements, … HTCondor uses a core of predefined attributes, but users can add other, new attributes, which can be used for matchmaking, reporting, etc.

Sample ClassAd Attributes MyType = "Job" TargetType = "Machine" ClusterId = 14 Owner = "cat" Cmd = "/.../test-job.py" Requirements = (Arch == "X86_64") && (OpSys == "LINUX") Rank = 0.0 In = "/dev/null" UserLog = "/.../test-job.log" Out = "test-job.out" Err = "test-job.err" NiceUser = false ShoeSize = 10

Sample ClassAd Attributes String MyType = "Job" TargetType = "Machine" ClusterId = 14 Owner = "cat" Cmd = "/.../test-job.py" Requirements = (Arch == "X86_64") && (OpSys == "LINUX") Rank = 0.0 In = "/dev/null" UserLog = "/.../test-job.log" Out = "test-job.out" Err = "test-job.err" NiceUser = false ShoeSize = 10

Sample ClassAd Attributes MyType = "Job" TargetType = "Machine" ClusterId = 14 Number Owner = "cat" Cmd = "/.../test-job.py" Requirements = (Arch == "X86_64") && (OpSys == "LINUX") Rank = 0.0 In = "/dev/null" UserLog = "/.../test-job.log" Out = "test-job.out" Err = "test-job.err" NiceUser = false ShoeSize = 10

Sample ClassAd Attributes MyType = "Job" TargetType = "Machine" ClusterId = 14 Owner = "cat" Cmd = "/.../test-job.py" Requirements = (Arch == "X86_64") && (OpSys == "LINUX") Operations/ expressions Rank = 0.0 In = "/dev/null" UserLog = "/.../test-job.log" Out = "test-job.out" Err = "test-job.err" NiceUser = false ShoeSize = 10

Sample ClassAd Attributes MyType = "Job" TargetType = "Machine" ClusterId = 14 Owner = "cat" Cmd = "/.../test-job.py" Requirements = (Arch == "X86_64") && (OpSys == "LINUX") Rank = 0.0 In = "/dev/null" UserLog = "/.../test-job.log" Out = "test-job.out" Err = "test-job.err" NiceUser = false ShoeSize = 10 Boolean

Sample ClassAd Attributes MyType = "Job" TargetType = "Machine" ClusterId = 14 Owner = "cat" Cmd = "/.../test-job.py" Requirements = (Arch == "X86_64") && (OpSys == "LINUX") Rank = 0.0 In = "/dev/null" UserLog = "/.../test-job.log" Out = "test-job.out" Err = "test-job.err" NiceUser = false ShoeSize = 10 Arbitrary