Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
HTCondor scheduling policy
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Matchmaking in the Condor System Rajesh Raman Computer Sciences Department University of Wisconsin-Madison
Priority and Provisioning Greg Thain HTCondorWeek 2015.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Jim Basney Computer Sciences Department University of Wisconsin-Madison Managing Network Resources in.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
Grid Computing I CONDOR.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
Migration to 7.4, Group Quotas, and More William Strecker-Kellogg Brookhaven National Lab.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Ian D. Alderman Computer Sciences Department University of Wisconsin-Madison Condor Week 2008 End-to-end.
Condor Services for the Global Grid: Interoperability between OGSA and Condor Clovis Chapman 1, Paul Wilson 2, Todd Tannenbaum 3, Matthew Farrellee 3,
Landing in the Right Nest: New Negotiation Features for Enterprise Environments Jason Stowe.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Job Router.
How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
Condor Project Computer Sciences Department University of Wisconsin-Madison Using New Features in Condor 7.2.
Matchmaker Policies: Users and Groups HTCondor Week, Madison 2016 Zach Miller Jaime Frey Center for High Throughput.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
CHTC Policy and Configuration
Debugging Common Problems in HTCondor
Improvements to Configuration
Experience on HTCondor batch system for HEP and other research fields at KISTI-GSDC Sang Un Ahn, Sangwook Bae, Amol Jaikar, Jin Kim, Byungyun Kong, Ilyeon.
Quick Review. Job and Machine Policy Configuration HTCondor / ARC CE Workshop Barcelona 2016 Todd Tannenbaum.
HTCondor Security Basics
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Matchmaker Policies: Users and Groups HTCondor Week, Madison 2017
Job and Machine Policy Configuration
High Availability in HTCondor
Accounting in HTCondor
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Negotiator Policy and Configuration
Accounting, Group Quotas, and User Priorities
Condor and Multi-core Scheduling
Condor Glidein: Condor Daemons On-The-Fly
Basic Grid Projects – Condor (Part I)
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
The Condor JobRouter.
Condor: Firewall Mirroring
Condor Administration in the Open Science Grid
Grid Laboratory Of Wisconsin (GLOW)
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
Negotiator Policy and Configuration
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to

Dan, Condor Week 2008 Where to Find the Online How-to Collection 1. Go to 2. Click on “Condor Admin How-to Recipes” Currently, that takes you here:

Dan, Condor Week 2008 Brief Overview of Selected Bits

Dan, Condor Week 2008 Question › How does Condor decide which job gets to run on an execute machine?

Dan, Condor Week 2008 The Life of a Condor Job schedd (job queue) condor_submit startd (Job Executor) central manager (collector + negotiator) central manager 2 central manager 3 (collector + negotiator) flocking machine ClassAd job runs job ClassAd

Dan, Condor Week 2008 First Stop: Authorization › User must be authorized to submit to schedd ALLOW_WRITE = allow1, allow2, … DENY_WRITE = deny1, deny2, … › By defualt, all authenticated users may submit jobs within trusted network ALLOW_WRITE = */network HOSTALLOW_WRITE = network (old style)

Dan, Condor Week 2008 Next Stop: The Job Queue › MAX_JOBS_RUNNING = 200 › Job priority = integer  orders a user’s jobs  higher priority will run sooner

Dan, Condor Week 2008 Authorization of the Schedd to Join Pool › ALLOW_ADVERTISE_SCHEDD DENY_ADVERTISE_SCHEDD  Default: ALLOW/DENY_DAEMON Default: ALLOW/DENY_WRITE › COLLECTOR_REQUIREMENTS  Default: true

Dan, Condor Week 2008 Next Stop: Negotiator Fair Share User priority Inversely proportional to fair share Example: two users, 60 batch slots priority 50- gets 40 slots priority 100- gets 20 slots

Dan, Condor Week 2008 Fair Share Dynamics › User priority changes over time  wants to be equal to number of slots in use › Example:  User steadily running 100 jobs: priority 100  Stops running jobs: 1 day later: priority 50 2 days later: priority 25 › Configure speed of adjustment: PRIORITY_HALFLIFE = 86400

Dan, Condor Week 2008 Modified Fair Share › User Priority Factor  multiplies the “real user priority”  result is called “effective user priority” › Example: condor_userprio -setfactor 4.0 condor_userprio -setfactor 1.0  atlas steadily uses 10 slots - effective priority 40  cms steadily uses 20 slots - effective priority 20

Dan, Condor Week 2008 Reporting Condor Pool Usage % condor_userprio -usage -allusers Last Priority Update: 7/30 09:59 Accumulated Usage Last User Name Usage (hrs) Start Time Usage Time … /18/ :37 7/30/ : /03/ :56 7/30/ : /03/ :56 7/30/ : /03/ :54 7/30/ : Number of users: /03/ :56 7/29/ :00 › When upgrading Condor, preserve the central manager ’ s AccountantLog  Happens automatically if you follow general rule: preserve Condor ’ s LOCAL_DIR

Dan, Condor Week 2008 Matchmaking › Job requirements and machine requirements must both be met › Machine requirements are configured via the START expression START = Owner == "appinstaller"

Dan, Condor Week 2008 Adding to Job Requirements APPEND_REQUIREMENTS = MY.Owner != "appinstaller" || TARGET.IsAppInstallerMachine =?= True

Dan, Condor Week 2008 Adding Attribute to Machine ClassAd IsAppInstallerMachine = True STARTD_ATTRS = $(STARTD_ATTRS) IsAppInstallerMachine

Dan, Condor Week 2008 Choosing Between Matching Machines 1. NEGOTIATOR_PRE_JOB_RANK 2. job rank expression 3. NEGOTIATOR_POST_JOB_RANK 4. PREEMPTION_RANK

Dan, Condor Week 2008 Example NEGOTIATOR_PRE_JOB_RANK = (IsDesktop =!= True && isUndefined(RemoteOwner)) + isUndefined(RemoteOwner) › Most desirable to least:  2 unclaimed and not a desktop  1 unclaimed and desktop  0 claimed

Dan, Condor Week 2008 Authorizing Schedd to Claim Startd › ALLOW/DENY_WRITE › It is the schedd which is authorized by the startd, not the user.

Dan, Condor Week 2008 Preemption

Dan, Condor Week 2008 Machine Rank › Numerical expression:  higher number preempts lower number  user priority is secondary to rank, because higher rank job preempts claim to machine › Example:  CMS gets 1st prio, CDF gets 2nd, others 3rd RANK = 2*(User == + 1*(User ==

Dan, Condor Week 2008 Another Rank Example Rank = (Group =?= "LMCG") * ( RushJob)

Dan, Condor Week 2008 Note on Scope of Condor Policies › pool-wide scope: example negotiator  user priorities, factors, etc.  preemption policy related to user priority  steering jobs via negotiator job rank › execute machine/slot scope: startd  machine rank, requirements  preemption/suspension policy  customized machine ClassAd values › submit machine scope  queue policy, automatic additions to job requirements, and insertion of arbitrary ClassAd attributes into job › personal scope  environmental configurations: _CONDOR_ =value

Dan, Condor Week 2008 Preemption Policy › Should Condor jobs yield to non-condor activity on the machine? › Should some types of jobs never be interrupted? After 4 days? › Should some jobs immediately preempt others? After 30 minutes? › Is suspension more desirable than killing? › Can need for preemption be decreased by steering jobs towards the right machines?

Dan, Condor Week 2008 Example Preemption Policy When a claim is preempted, do not allow killing of jobs younger than 4 days old. MaxJobRetirementTime = 3600 * 24 * 4 › Applies to all forms of preemption:  user priority, machine rank, machine activity, graceful shutdown

Dan, Condor Week 2008 Another Preemption Policy › Expression can refer to attributes of batch slot and job, so can be highly customized. MaxJobRetirementTime = 3600 * 24 * 4 * (OSG_VO =?= “uscms”)

Dan, Condor Week 2008 More Preemption Controls › PREEMPTION_REQUIREMENTS  controls user-priority based preemption at the level of the negotiator › PREEMPT/SUSPEND  controls preemption by machine activity (e.g. keyboard or cpu activity) › RANK  allows preemption by more desirable jobs

Dan, Condor Week 2008 Preemption Policy Pitfall › If you disable all forms of preemption, you probably want to limit lifespan of claims: PREEMPTION_REQUIRMENTS = False PREEMPT = False RANK = 0 CLAIM_WORKLIFE = 3600 Otherwise, reallocation of resources will not happen until a user runs out of matching jobs.

Dan, Condor Week 2008 What Happens to Preempted Jobs? › Back to idle in job queue  NumJobStarts >= 1 › job policy: periodic_hold, periodic_remove › admin policy: SYSTEM_PERIODIC_HOLD SYSTEM_PERIODIC_REMOVE

Dan, Condor Week 2008 Back to the Negotiator: Group Accounting

Dan, Condor Week 2008 Fair Sharing Between Groups Useful when: multiple user ids belong to same group group’s share of pool is not tied to specific machines # Example group settings GROUP_NAMES = group_physics, group_chemistry GROUP_QUOTA_group_physics = 200 GROUP_QUOTA_group_chemistry = 100 GROUP_AUTOREGROUP = True GROUP_PRIO_FACTOR_group_physics = 10 GROUP_PRIO_FACTOR_group_chemistry = 10 DEFAULT_PRIO_FACTOR = 100

Dan, Condor Week 2008 Setting Group Identity The job advertises its own group identity: +AccountingGroup = “group_physics.dan” group name group user Anyone can declare any identity. This is not the unix/windows identity the job runs as. It is solely for accounting and prioritization purposes.

Dan, Condor Week 2008 Monitoring Usage % condor_userprio -usage -allusers Last Priority Update: 7/30 09:59 Accumulated Usage Last User Name Usage (hrs) Start Time Usage Time … /18/ :37 7/30/ : /03/ :56 7/30/ : /03/ :56 7/30/ : /03/ :54 7/30/ : Number of users: /03/ :56 7/29/ :00 % condor_userprio -all -allusers

Dan, Condor Week 2008 How do groups compete? › Group using least share of its quota gets top priority in matchmaking.

Dan, Condor Week 2008 How do user’s within group compete? › Each group user has its own user priority › Fair share between group members determined by the usual user priority mechanism

Dan, Condor Week 2008 May Group Exceed its Quota? › Yes, but only if GROUP_AUTOREGROUP = True OR, if undefined GROUP_AUTOREGROUP_ = True

Dan, Condor Week 2008 When Exceeding Quota, How do Users Compete? › All non-group users plus group users trying to exceed their quota compete for remaining machines. › The user priority of the group user (e.g. “group_physics.dan”) is used to determine fair share.  Can set default priority factor for all members of group: GROUP_PRIO_FACTOR_ = 10

Dan, Condor Week 2008 The End of the Story

Dan, Condor Week 2008 The Life of a Condor Job schedd (job queue) condor_submit startd (Job Executor) central manager (collector + negotiator) central manager 2 central manager 3 (collector + negotiator) flocking machine ClassAd job runs job ClassAd

Dan, Condor Week 2008 Extending the Reach › FLOCK_TO =  requires bi-directional connectivity  in Linux, can use GCB to connect private networks › Grid Universe: Globus, Condor-C  condor_glidein  JobRouter

Dan, Condor Week 2008 Trivia › What’s the difference? IsHighPrioUser = Owner == “dan” 1. RANK = IsHighPrioUser 2. RANK = $(IsHighPrioUser) › case 1 needs: STARTD_ATTRS = IsHighPrioUser

Dan, Condor Week 2008 Where to Find the Online How-to Collection 1. Go to 2. Click on “Condor Admin How-to Recipes” Currently, that takes you here: