Landing in the Right Nest: New Negotiation Features for Enterprise Environments Jason Stowe.

Slides:



Advertisements
Similar presentations
UK Condor Week NeSC – Scotland – Oct 2004 Condor Team Computer Sciences Department University of Wisconsin-Madison The Bologna.
Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
HTCondor scheduling policy
GPU Computing with Hartford Condor Week 2012 Bob Nordlund.
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
Matchmaking in the Condor System Rajesh Raman Computer Sciences Department University of Wisconsin-Madison
Making Omelets Without Breaking Eggs: Adding Enterprise Features to Condor Jason Stowe.
Priority and Provisioning Greg Thain HTCondorWeek 2015.
The Condor Data Access Framework GridFTP / NeST Day 31 July 2001 Douglas Thain.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
What is Scrum Process? Where is it used? How is it better?
Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Jason Stowe Condor Week 2009 April 22 nd, Coming to Condor Week since Started as a User.
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
1 1 Vulnerability Assessment of Grid Software Jim Kupsch Associate Researcher, Dept. of Computer Sciences University of Wisconsin-Madison Condor Week 2006.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, UWisc Condor Week April 13, 2010.
Thinking Outside the Nest Utilizing Enterprise Resources with Condor Bob Nordlund The Hartford Condor Week 2006.
PROOF work progress. Progress on PROOF The TCondor class was rewritten. Tested on a condor pool with 44 nodes. Monitoring with Ganglia page. The tests.
Condor In Flight at The Hartford 2006 Transformations Condor Week 2007 Bob Nordlund.
Grid Computing at The Hartford Condor Week 2008 Robert Nordlund
The Owner Share scheduler for a distributed system 2009 International Conference on Parallel Processing Workshops Reporter: 李長霖.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.
CMS Week, June 7-11, CMS Production in Wisconsin Status of recent developments. Dan Bradley Sridhara Dasu Vivek Puttabuddhi Wesley Smith The Condor.
Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of.
Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.
Migration to 7.4, Group Quotas, and More William Strecker-Kellogg Brookhaven National Lab.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Theories of Agile, Fails of Security Daniel Liber CyberArk.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
© 2005 Altera Corporation © 2006 Altera Corporation Batch Computing at Altera Condor, Quill and The Enterprise.
Weekly Work Dates:2010 8/20~8/25 Subject:Condor C.Y Hsieh.
Condor on WAN D. Bortolotti - INFN Bologna T. Ferrari - INFN Cnaf A.Ghiselli - INFN Cnaf P.Mazzanti - INFN Bologna F. Prelz - INFN Milano F.Semeria - INFN.
Managing Network Resources in Condor Jim Basney Computer Sciences Department University of Wisconsin-Madison
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.
John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison, WI Caging the CCLRC Compute Zoo (Activities at.
Matchmaker Policies: Users and Groups HTCondor Week, Madison 2016 Zach Miller Jaime Frey Center for High Throughput.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Leadership Launch Module 11: Introduction to School Wide Information System (SWIS) and the Student Risk Screening Scale District Cohort 1 1.
Five todos when moving an application to distributed HTC.
Improvements to Configuration
SEARCH ENGINE OPTIMIZATION
Experience on HTCondor batch system for HEP and other research fields at KISTI-GSDC Sang Un Ahn, Sangwook Bae, Amol Jaikar, Jin Kim, Byungyun Kong, Ilyeon.
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
Connecting LRMS to GRMS
HTCondor Security Basics
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
Matchmaker Policies: Users and Groups HTCondor Week, Madison 2017
Condor – A Hunter of Idle Workstation
CREAM-CE/HTCondor site
Building Grids with Condor
Accounting in HTCondor
SEARCH ENGINE OPTIMIZATION
Building and Testing using Condor
Negotiator Policy and Configuration
Accounting, Group Quotas, and User Priorities
Basic Grid Projects – Condor (Part I)
How I learned to Stop Worrying and Love Preemption
Condor: Firewall Mirroring
Condor Administration in the Open Science Grid
GLOW A Campus Grid within OSG
Negotiator Policy and Configuration
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

Landing in the Right Nest: New Negotiation Features for Enterprise Environments Jason Stowe

New Features for Negotiation

Experience in Enterprise Environments

What is an Enterprise Environment?

Any Organization Using Condor with

Demanding Users

Organization = Groups of Demanding Users

Purchased Computer Capacity

Guaranteed Minimum Capacity

Need As Many as Possible

As Soon as they submit

Vanilla/Java Universe

Avoid Preemption

How do we ensure Resources land in the right Group’s Nest?

A valid definition of Enterprise Condor Users? Enterprise Condor Users

I started off as a Demanding User

Follow up to earlier work

Condor Week 2005

Condor for Movies: 75+ Million Jobs CPUs (Linux/OSX) 70+ TB storage

(Project that added AccountingGroups)

Condor Week 2006

Web-based Management Tools, Consulting, and 24/7 Support

A Conversation with Miron

Bob Nordlund’s idea for Condor += Hooks

Configuration with Pipes CONDOR_CONFIG = cat /opt/condor/condor_config | (Condor 6.8)

Demanding Condor Uses for Banks/Insurance Companies => This year, new features

Negotiation Policies to Manage Number of Resources

For Groups and Users

What are the Requirements?

-Guaranteed Minimum Quota -Fast Claiming of Quota -Avoid Unnecessary Preemption

Three Common Ways

“Fair share” User Priority PREEMPTION_REQUIREMENTS

Machine RANK

AccountingGroups GROUP_QUOTA

Generally these are a progression

Story of a Pool

Fair-Share, User Priority

It Works! More Users…

condor_userprio –setfactor A 2 condor_userprio –setfactor B 2

PREEMPTION_REQUIREMENTS = RemoteUserPrio > SubmittorPrio

Works Well in Most cases

Suppose A has all 100 machines, and B submits 100 jobs

User Priorities Cached at Beginning of Negotiation And not updated…

PREEMPTION_REQUIREMENTS = RemoteUserPrio > SubmittorPrio

Standard Universe = No Problem (Preemption doesn’t lose work)

Problem: Vanilla or Java Universe (Work is lost!)

Dampen these with NEGOTIATOR_MAX_TIME_PER_SUBMITTER NEGOTIATOR_MAX_TIME_PER_PIESPIN

Slows matching rate, can lead to starvation

Time For RANK

RANK = Owner =?= “A” on 50 Machines RANK = Owner =?= “B” on 50 Machines

Users get their “quota”

Tied to particular machines

Problem: Group A submits 100 jobs on Empty Pool

50 jobs Finish

Group B submits 100 jobs, Empty Machines get jobs A Jobs on B Machines are preempted

B Jobs on A Machines are preempted.

Skip Preemption, Use Empty Machines?

Accounting Groups, GROUP_QUOTA

#New Machines = 200 GROUP_QUOTA_A = 50 GROUP_QUOTA_B = 50 GROUP_QUOTA_C = 50 GROUP_QUOTA_D = 50 GROUP_AUTOREGROUP = True

A, B Have 100 machines each, how does C get resources?

PREEMPTION_REQUIREMENTS Still has cache/preemption issues

We Need access to Up to Date Usage/Quota information PREEMPTION_REQUIREMENTS

A Conversation with Todd

SubmitterUserPrio SubmitterUserResourcesInUse (RemoteUser as well)

SubmitterGroupQuota SubmitterGroupResourcesInUse (RemoteGroup as well)

With Great Power Comes Great Responsibility

IMPORTANT: Turn-off Caching (may slow down) PREEMPTION_REQUIREMENTS_STABLE= False PREEMPTION_RANK_STABLE = False

PREEMPTION_REQUIREMENTS = (SubmitterGroupResourcesInUse RemoteGroupQuota) PREEMPTION_REQUIREMENTS_STABLE= False RANK = 0

Now we have everything needed!

Demanding Groups of Users

Getting Purchased Compute Capacity (Quota, not tied to machine)

Getting Guaranteed Minimum Capacity (GROUP_QUOTA)

Getting As Many as Possible (Auto-Regroup)

Getting As Soon as they submit (One Negotiation Cycle typically)

Avoids Preemption

condor_status?

It Works! (patched 6.8 and 6.9+) Code & Condor Community Process

Where do we go from here? What did we learn?

Wisconsin is Working on 6.9 Negotiation/Scheduling more Efficient

In the Future Allow us to Specify what we Account For per VM/Slot (KFLOPS) ?

That’s just me…

Come to tonight’s Reception Participate in the Community

Talk with Condor Team. Talk with other users.

Help the community continue to work well for everyone.

Thank you. Questions? cyclecomputing.com