Landing in the Right Nest: New Negotiation Features for Enterprise Environments Jason Stowe
New Features for Negotiation
Experience in Enterprise Environments
What is an Enterprise Environment?
Any Organization Using Condor with
Demanding Users
Organization = Groups of Demanding Users
Purchased Computer Capacity
Guaranteed Minimum Capacity
Need As Many as Possible
As Soon as they submit
Vanilla/Java Universe
Avoid Preemption
How do we ensure Resources land in the right Group’s Nest?
A valid definition of Enterprise Condor Users? Enterprise Condor Users
I started off as a Demanding User
Follow up to earlier work
Condor Week 2005
Condor for Movies: 75+ Million Jobs CPUs (Linux/OSX) 70+ TB storage
(Project that added AccountingGroups)
Condor Week 2006
Web-based Management Tools, Consulting, and 24/7 Support
A Conversation with Miron
Bob Nordlund’s idea for Condor += Hooks
Configuration with Pipes CONDOR_CONFIG = cat /opt/condor/condor_config | (Condor 6.8)
Demanding Condor Uses for Banks/Insurance Companies => This year, new features
Negotiation Policies to Manage Number of Resources
For Groups and Users
What are the Requirements?
-Guaranteed Minimum Quota -Fast Claiming of Quota -Avoid Unnecessary Preemption
Three Common Ways
“Fair share” User Priority PREEMPTION_REQUIREMENTS
Machine RANK
AccountingGroups GROUP_QUOTA
Generally these are a progression
Story of a Pool
Fair-Share, User Priority
It Works! More Users…
condor_userprio –setfactor A 2 condor_userprio –setfactor B 2
PREEMPTION_REQUIREMENTS = RemoteUserPrio > SubmittorPrio
Works Well in Most cases
Suppose A has all 100 machines, and B submits 100 jobs
User Priorities Cached at Beginning of Negotiation And not updated…
PREEMPTION_REQUIREMENTS = RemoteUserPrio > SubmittorPrio
Standard Universe = No Problem (Preemption doesn’t lose work)
Problem: Vanilla or Java Universe (Work is lost!)
Dampen these with NEGOTIATOR_MAX_TIME_PER_SUBMITTER NEGOTIATOR_MAX_TIME_PER_PIESPIN
Slows matching rate, can lead to starvation
Time For RANK
RANK = Owner =?= “A” on 50 Machines RANK = Owner =?= “B” on 50 Machines
Users get their “quota”
Tied to particular machines
Problem: Group A submits 100 jobs on Empty Pool
50 jobs Finish
Group B submits 100 jobs, Empty Machines get jobs A Jobs on B Machines are preempted
B Jobs on A Machines are preempted.
Skip Preemption, Use Empty Machines?
Accounting Groups, GROUP_QUOTA
#New Machines = 200 GROUP_QUOTA_A = 50 GROUP_QUOTA_B = 50 GROUP_QUOTA_C = 50 GROUP_QUOTA_D = 50 GROUP_AUTOREGROUP = True
A, B Have 100 machines each, how does C get resources?
PREEMPTION_REQUIREMENTS Still has cache/preemption issues
We Need access to Up to Date Usage/Quota information PREEMPTION_REQUIREMENTS
A Conversation with Todd
SubmitterUserPrio SubmitterUserResourcesInUse (RemoteUser as well)
SubmitterGroupQuota SubmitterGroupResourcesInUse (RemoteGroup as well)
With Great Power Comes Great Responsibility
IMPORTANT: Turn-off Caching (may slow down) PREEMPTION_REQUIREMENTS_STABLE= False PREEMPTION_RANK_STABLE = False
PREEMPTION_REQUIREMENTS = (SubmitterGroupResourcesInUse RemoteGroupQuota) PREEMPTION_REQUIREMENTS_STABLE= False RANK = 0
Now we have everything needed!
Demanding Groups of Users
Getting Purchased Compute Capacity (Quota, not tied to machine)
Getting Guaranteed Minimum Capacity (GROUP_QUOTA)
Getting As Many as Possible (Auto-Regroup)
Getting As Soon as they submit (One Negotiation Cycle typically)
Avoids Preemption
condor_status?
It Works! (patched 6.8 and 6.9+) Code & Condor Community Process
Where do we go from here? What did we learn?
Wisconsin is Working on 6.9 Negotiation/Scheduling more Efficient
In the Future Allow us to Specify what we Account For per VM/Slot (KFLOPS) ?
That’s just me…
Come to tonight’s Reception Participate in the Community
Talk with Condor Team. Talk with other users.
Help the community continue to work well for everyone.
Thank you. Questions? cyclecomputing.com