Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.

Similar presentations


Presentation on theme: "Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to."— Presentation transcript:

1 Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu http://www.cs.wisc.edu/condor Condor Administrator’s How-to

2 www.cs.wisc.edu/condor Dan, Condor Week 2008 Where to Find the Online How-to Collection 1. Go to http://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/ 2. Click on “Condor Admin How-to Recipes” Currently, that takes you here: http://nmi.cs.wisc.edu/node/1465 http://nmi.cs.wisc.edu/node/1465

3 www.cs.wisc.edu/condor Dan, Condor Week 2008 Brief Overview of Selected Bits

4 www.cs.wisc.edu/condor Dan, Condor Week 2008 Question › How does Condor decide which job gets to run on an execute machine?

5 www.cs.wisc.edu/condor Dan, Condor Week 2008 The Life of a Condor Job schedd (job queue) condor_submit startd (Job Executor) central manager (collector + negotiator) central manager 2 central manager 3 (collector + negotiator) flocking machine ClassAd job runs job ClassAd

6 www.cs.wisc.edu/condor Dan, Condor Week 2008 First Stop: Authorization › User must be authorized to submit to schedd ALLOW_WRITE = allow1, allow2, … DENY_WRITE = deny1, deny2, … user@uid_domain/network › By defualt, all authenticated users may submit jobs within trusted network ALLOW_WRITE = */network HOSTALLOW_WRITE = network (old style)

7 www.cs.wisc.edu/condor Dan, Condor Week 2008 Next Stop: The Job Queue › MAX_JOBS_RUNNING = 200 › Job priority = integer  orders a user’s jobs  higher priority will run sooner

8 www.cs.wisc.edu/condor Dan, Condor Week 2008 Authorization of the Schedd to Join Pool › ALLOW_ADVERTISE_SCHEDD DENY_ADVERTISE_SCHEDD  Default: ALLOW/DENY_DAEMON Default: ALLOW/DENY_WRITE › COLLECTOR_REQUIREMENTS  Default: true

9 www.cs.wisc.edu/condor Dan, Condor Week 2008 Next Stop: Negotiator Fair Share User priority Inversely proportional to fair share Example: two users, 60 batch slots priority 50- gets 40 slots priority 100- gets 20 slots

10 www.cs.wisc.edu/condor Dan, Condor Week 2008 Fair Share Dynamics › User priority changes over time  wants to be equal to number of slots in use › Example:  User steadily running 100 jobs: priority 100  Stops running jobs: 1 day later: priority 50 2 days later: priority 25 › Configure speed of adjustment: PRIORITY_HALFLIFE = 86400

11 www.cs.wisc.edu/condor Dan, Condor Week 2008 Modified Fair Share › User Priority Factor  multiplies the “real user priority”  result is called “effective user priority” › Example: condor_userprio -setfactor atlas@hep.wisc.edu 4.0 condor_userprio -setfactor cms@hep.wisc.edu 1.0  atlas steadily uses 10 slots - effective priority 40  cms steadily uses 20 slots - effective priority 20

12 www.cs.wisc.edu/condor Dan, Condor Week 2008 Reporting Condor Pool Usage % condor_userprio -usage -allusers Last Priority Update: 7/30 09:59 Accumulated Usage Last User Name Usage (hrs) Start Time Usage Time ------------------------------ ----------- ---------------- ---------------- … osg_usatlas1@hep.wisc.edu 599739.09 4/18/2006 14:37 7/30/2007 07:24 jherschleb@lmcg.wisc.edu 799300.91 4/03/2006 12:56 7/30/2007 09:59 szhou@lmcg.wisc.edu 1029384.68 4/03/2006 12:56 7/30/2007 09:59 osg_cmsprod@hep.wisc.edu 2013058.70 4/03/2006 16:54 7/30/2007 09:59 ------------------------------ ----------- ---------------- ---------------- Number of users: 271 8517482.95 4/03/2006 12:56 7/29/2007 10:00 › When upgrading Condor, preserve the central manager ’ s AccountantLog  Happens automatically if you follow general rule: preserve Condor ’ s LOCAL_DIR

13 www.cs.wisc.edu/condor Dan, Condor Week 2008 Matchmaking › Job requirements and machine requirements must both be met › Machine requirements are configured via the START expression START = Owner == "appinstaller"

14 www.cs.wisc.edu/condor Dan, Condor Week 2008 Adding to Job Requirements APPEND_REQUIREMENTS = MY.Owner != "appinstaller" || TARGET.IsAppInstallerMachine =?= True

15 www.cs.wisc.edu/condor Dan, Condor Week 2008 Adding Attribute to Machine ClassAd IsAppInstallerMachine = True STARTD_ATTRS = $(STARTD_ATTRS) IsAppInstallerMachine

16 www.cs.wisc.edu/condor Dan, Condor Week 2008 Choosing Between Matching Machines 1. NEGOTIATOR_PRE_JOB_RANK 2. job rank expression 3. NEGOTIATOR_POST_JOB_RANK 4. PREEMPTION_RANK

17 www.cs.wisc.edu/condor Dan, Condor Week 2008 Example NEGOTIATOR_PRE_JOB_RANK = (IsDesktop =!= True && isUndefined(RemoteOwner)) + isUndefined(RemoteOwner) › Most desirable to least:  2 unclaimed and not a desktop  1 unclaimed and desktop  0 claimed

18 www.cs.wisc.edu/condor Dan, Condor Week 2008 Authorizing Schedd to Claim Startd › ALLOW/DENY_WRITE › It is the schedd which is authorized by the startd, not the user.

19 www.cs.wisc.edu/condor Dan, Condor Week 2008 Preemption

20 www.cs.wisc.edu/condor Dan, Condor Week 2008 Machine Rank › Numerical expression:  higher number preempts lower number  user priority is secondary to rank, because higher rank job preempts claim to machine › Example:  CMS gets 1st prio, CDF gets 2nd, others 3rd RANK = 2*(User == “cms@hep.wisc.edu”) + 1*(User == “cdf@hep.wisc.edu”)

21 www.cs.wisc.edu/condor Dan, Condor Week 2008 Another Rank Example Rank = (Group =?= "LMCG") * (1000 + RushJob)

22 www.cs.wisc.edu/condor Dan, Condor Week 2008 Note on Scope of Condor Policies › pool-wide scope: example negotiator  user priorities, factors, etc.  preemption policy related to user priority  steering jobs via negotiator job rank › execute machine/slot scope: startd  machine rank, requirements  preemption/suspension policy  customized machine ClassAd values › submit machine scope  queue policy, automatic additions to job requirements, and insertion of arbitrary ClassAd attributes into job › personal scope  environmental configurations: _CONDOR_ =value

23 www.cs.wisc.edu/condor Dan, Condor Week 2008 Preemption Policy › Should Condor jobs yield to non-condor activity on the machine? › Should some types of jobs never be interrupted? After 4 days? › Should some jobs immediately preempt others? After 30 minutes? › Is suspension more desirable than killing? › Can need for preemption be decreased by steering jobs towards the right machines?

24 www.cs.wisc.edu/condor Dan, Condor Week 2008 Example Preemption Policy When a claim is preempted, do not allow killing of jobs younger than 4 days old. MaxJobRetirementTime = 3600 * 24 * 4 › Applies to all forms of preemption:  user priority, machine rank, machine activity, graceful shutdown

25 www.cs.wisc.edu/condor Dan, Condor Week 2008 Another Preemption Policy › Expression can refer to attributes of batch slot and job, so can be highly customized. MaxJobRetirementTime = 3600 * 24 * 4 * (OSG_VO =?= “uscms”)

26 www.cs.wisc.edu/condor Dan, Condor Week 2008 More Preemption Controls › PREEMPTION_REQUIREMENTS  controls user-priority based preemption at the level of the negotiator › PREEMPT/SUSPEND  controls preemption by machine activity (e.g. keyboard or cpu activity) › RANK  allows preemption by more desirable jobs

27 www.cs.wisc.edu/condor Dan, Condor Week 2008 Preemption Policy Pitfall › If you disable all forms of preemption, you probably want to limit lifespan of claims: PREEMPTION_REQUIRMENTS = False PREEMPT = False RANK = 0 CLAIM_WORKLIFE = 3600 Otherwise, reallocation of resources will not happen until a user runs out of matching jobs.

28 www.cs.wisc.edu/condor Dan, Condor Week 2008 What Happens to Preempted Jobs? › Back to idle in job queue  NumJobStarts >= 1 › job policy: periodic_hold, periodic_remove › admin policy: SYSTEM_PERIODIC_HOLD SYSTEM_PERIODIC_REMOVE

29 www.cs.wisc.edu/condor Dan, Condor Week 2008 Back to the Negotiator: Group Accounting

30 www.cs.wisc.edu/condor Dan, Condor Week 2008 Fair Sharing Between Groups Useful when: multiple user ids belong to same group group’s share of pool is not tied to specific machines # Example group settings GROUP_NAMES = group_physics, group_chemistry GROUP_QUOTA_group_physics = 200 GROUP_QUOTA_group_chemistry = 100 GROUP_AUTOREGROUP = True GROUP_PRIO_FACTOR_group_physics = 10 GROUP_PRIO_FACTOR_group_chemistry = 10 DEFAULT_PRIO_FACTOR = 100

31 www.cs.wisc.edu/condor Dan, Condor Week 2008 Setting Group Identity The job advertises its own group identity: +AccountingGroup = “group_physics.dan” group name group user Anyone can declare any identity. This is not the unix/windows identity the job runs as. It is solely for accounting and prioritization purposes.

32 www.cs.wisc.edu/condor Dan, Condor Week 2008 Monitoring Usage % condor_userprio -usage -allusers Last Priority Update: 7/30 09:59 Accumulated Usage Last User Name Usage (hrs) Start Time Usage Time ------------------------------ ----------- ---------------- ---------------- … group_physics.atlas@hep.wisc.edu 599739.09 4/18/2006 14:37 7/30/2007 07:24 group_physics.cms@hep.wisc.edu 799300.91 4/03/2006 12:56 7/30/2007 09:59 group_chemistry.han@che.wisc.edu 1029384.68 4/03/2006 12:56 7/30/2007 09:59 group_chemistry.ben@che.wisc.edu 2013058.70 4/03/2006 16:54 7/30/2007 09:59 ------------------------------ ----------- ---------------- ---------------- Number of users: 271 8517482.95 4/03/2006 12:56 7/29/2007 10:00 % condor_userprio -all -allusers

33 www.cs.wisc.edu/condor Dan, Condor Week 2008 How do groups compete? › Group using least share of its quota gets top priority in matchmaking.

34 www.cs.wisc.edu/condor Dan, Condor Week 2008 How do user’s within group compete? › Each group user has its own user priority › Fair share between group members determined by the usual user priority mechanism

35 www.cs.wisc.edu/condor Dan, Condor Week 2008 May Group Exceed its Quota? › Yes, but only if GROUP_AUTOREGROUP = True OR, if undefined GROUP_AUTOREGROUP_ = True

36 www.cs.wisc.edu/condor Dan, Condor Week 2008 When Exceeding Quota, How do Users Compete? › All non-group users plus group users trying to exceed their quota compete for remaining machines. › The user priority of the group user (e.g. “group_physics.dan”) is used to determine fair share.  Can set default priority factor for all members of group: GROUP_PRIO_FACTOR_ = 10

37 www.cs.wisc.edu/condor Dan, Condor Week 2008 The End of the Story

38 www.cs.wisc.edu/condor Dan, Condor Week 2008 The Life of a Condor Job schedd (job queue) condor_submit startd (Job Executor) central manager (collector + negotiator) central manager 2 central manager 3 (collector + negotiator) flocking machine ClassAd job runs job ClassAd

39 www.cs.wisc.edu/condor Dan, Condor Week 2008 Extending the Reach › FLOCK_TO =  requires bi-directional connectivity  in Linux, can use GCB to connect private networks › Grid Universe: Globus, Condor-C  condor_glidein  JobRouter

40 www.cs.wisc.edu/condor Dan, Condor Week 2008 Trivia › What’s the difference? IsHighPrioUser = Owner == “dan” 1. RANK = IsHighPrioUser 2. RANK = $(IsHighPrioUser) › case 1 needs: STARTD_ATTRS = IsHighPrioUser

41 www.cs.wisc.edu/condor Dan, Condor Week 2008 Where to Find the Online How-to Collection 1. Go to http://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/ 2. Click on “Condor Admin How-to Recipes” Currently, that takes you here: http://nmi.cs.wisc.edu/node/1465 http://nmi.cs.wisc.edu/node/1465


Download ppt "Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to."

Similar presentations


Ads by Google