Accounting in HTCondor Greg Thain INFN Workshop 2016
HTCondor Architecture
Overview of Condor Architecture Schedd A Central Manager worker Greg Job1 Greg Job2 Greg Job3 Ann Job1 Ann Job2 Ann Job3 worker worker Usage History Schedd B worker Greg Job4 Greg Job5 Greg Job6 Ann Job7 Ann Job8 Joe Job1 Joe Job2 Joe Job3 worker
2 Steps to Scheduling
Step 1: Schedd A Central Manager Schedd B worker Greg Job1 Greg Job2 Greg Job3 Ann Job1 Ann Job2 Ann Job3 worker Usage History worker Schedd B worker Greg Job4 Greg Job5 Greg Job6 Ann Job7 Ann Job8 Joe Job1 Joe Job2 Joe Job3 1: The CM assigns SLOTS to USERS based on historical fair share Keeps tracks of usage persistently Informs schedds slots for their users worker This is called the negotiation cycle
Step 2: Schedd A Central Manager Schedd B worker Greg Job1 Greg Job2 Greg Job3 Ann Job1 Ann Job2 Ann Job3 worker Usage History worker Schedd B worker Greg Job4 Greg Job5 Greg Job6 Ann Job7 Ann Job8 Joe Job1 Joe Job2 Joe Job3 1: Schedds assign SLOTS to JOBS based on job prio, fit will REUSE slots for > 1 job worker Many schedds in pools.
Consequences All accounting happens in CM Both monitoring and control Accounting data is persistent Accounting is rolled-up, not a log CM knows NOTHING about jobs!
What’s a user? Bob in schedd1 same as Bob in schedd2? If have same UID_DOMAIN, the are. Prevents cheating by adding shedds Map files can define the local user name
condor_userprio Command usage to view current priorities: condor_userprio –most Effective Priority User Name Priority Factor In Use (wghted-hrs) Last Usage ---------------------------------------------- --------- ------ ----------- ---------- lmichael@submit-3.chtc.wisc.edu 5.00 10.00 0 16.37 0+23:46 blin@osghost.chtc.wisc.edu 7.71 10.00 0 5412.38 0+01:05 osgtest@osghost.chtc.wisc.edu 90.57 10.00 47 45505.99 <now> cxiong36@submit-3.chtc.wisc.edu 500.00 1000.00 0 0.29 0+00:09 ojalvo@hep.wisc.edu 500.00 1000.00 0 398148.56 0+05:37 wjiang4@submit-3.chtc.wisc.edu 500.00 1000.00 0 0.22 0+21:25 cxiong36@submit.chtc.wisc.edu 500.00 1000.00 0 63.38 0+21:42 Tool to view/ change user prio
Metric: Effective Priority Negotiator computes, stores the user prio Inversely related to machines allocated (lower number is better priority) A user with priority of 10 will be able to claim twice as many machines as a user with priority 20
Effective User Priority (Effective) User Priority is determined by multiplying two components Real Priority * Priority Factor
Real Priority Based on actual usage, starts at .5 Approaches actual number of machines used over time Configuration setting PRIORITY_HALFLIFE If PRIORITY_HALFLIFE = +Inf, no history Default one day (in seconds) Asymptotically grows/shrinks to current usage
Priority Factor Assigned by administrator Set/viewed with condor_userprio Persistently stored in CM Defaults to 1000 (DEFAULT_PRIO_FACTOR) Allows admins to give prio to sets of users, while still having fair share within a group “Nice user”s have Prio Factors of 1,000,000
Condor principle #2
Condor principle #2 Condor provides access to operational info You need to make it pretty…
Condor_userprio -l For machine-parseable formats Parse yourself, and create graphs, reports.
condor_userprio -l Name = "gthain@chevre.cs.wisc.edu" ResourcesUsed = 11 WeightedResourcesUsed = 11.0 LastHeardFrom = 1477379405 LastUpdate = 1477379405 Priority = 500.0799865722656 WeightedAccumulatedUsage = 3878596.0 UpdateSequenceNumber = 0 PriorityFactor = 1000.0 MyType = "Accounting" IsAccountingGroup = false AccumulatedUsage = 3861736.0 BeginUsageTime = 1469125737
Overview of Condor Architecture Schedd A Central Manager worker Greg Job1 Greg Job2 Greg Job3 Ann Job1 Ann Job2 Ann Job3 worker worker Usage History Schedd B worker Greg Job4 Greg Job5 Greg Job6 Ann Job7 Ann Job8 Joe Job1 Joe Job2 Joe Job3 Schedds also keep logs rolled, not snapshots Have per-job data Gratia uses these worker
condor_history ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 10973.8 gthain 10/25 02:09 0+00:00:51 C 10/25 02:10 /scratch/gthai 10973.7 gthain 10/25 02:09 0+00:00:50 C 10/25 02:10 /scratch/gthai 10973.9 gthain 10/25 02:09 0+00:00:50 C 10/25 02:10 /scratch/gthai 10973.6 gthain 10/25 02:09 0+00:00:49 C 10/25 02:10 /scratch/gthai
Worker nodes also have history condor_history –f startd_history
Summary HTCondor uses two-steps for scheduling All accounting is handled in central manager Monitoring data also available in sched and startd