Talking Points: Dynamic Extension of HTCondor pools INFN HTCondor Workshop Oct 2016
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G") Expand via "Glideins" Submission of a pilot to grid universe Expand into Public Clouds
History: Always dynamic… Cluster Node master startd = Process Spawned Central Manager (Frieda’s) master collector negotiator schedd startd = ClassAd Communication Pathway Desktop schedd startd master
Expand via Flocking
your workstation personal Condor Condor Pool 600 Condor jobs Friendly Condor Pool
Collector Negotiator Collector Negotiator Collector Negotiator Schedd Friendly destination pool adds to condor_config FLOCK_FROM = your.cm.edu You add line to your condor_config : FLOCK_TO = Pool-Foo.edu, Pool-Bar.edu Collector Negotiator Collector Negotiator Collector Negotiator Submit Machine Central Manager (CONDOR_HOST) Pool-Foo Central Manager Pool-Bar Central Manager Schedd
Condor Flocking Remote pools are contacted in the order specified until jobs are satisfied The list of remote pools is a property of the Schedd, not the Central Manager So different users can Flock to different pools And remote pools can allow specific users User-priority system is “flocking-aware” A pool’s local users can have priority over remote users “flocking” in.
Condor Flocking, cont. Flocking is “Condor” specific technology… Frieda also has access to Globus resources she wants to use She has certificates and access to Globus gatekeepers at remote institutions But Frieda wants Condor’s queue management features for her Globus jobs! She installs Condor-G so she can submit “Globus Universe” jobs to Condor
Network Considerations Can only flock to sites where nodes have outgoing network connectivity Incoming not required if remote site enabled CCB Want to support sending jobs to remote sites? Central manager and submit machines should be on public IP addresses Want to enable mixed mode IPv4 / IPv6? Central manager and submit machines should be on dual-homed (IPv4 and IPv6 connected) machines
Grid Universe in HTCondor ("Condor-G")
Grid Universe (Condor-G) Reliable, durable submission of a job to a remote scheduler Popular way to send pilot jobs, key component of HTCondor- CE Supports many “back end” types: HTCondor ("Condor-C") PBS LSF Grid Engine Google Compute Engine Amazon EC2 OpenStack Cream NorduGrid ARC BOINC Globus: GT2, GT5 UNICORE 11
Add Grid Universe support for SLURM, OpenStack, Cobalt Speak native SLURM protocol No need to install PBS compatibility package Speak OpenStack’s NOVA protocol No need for EC2 compatibility layer Speak to Cobalt Scheduler Argonne Leadership Computing Facilities Jaime: Grid Jedi
One Solution: Condor-G GlideIn Frieda needs a bigger Condor pool. She can use the Grid Universe to run Condor daemons on remote clusters When the resources run these "GlideIn jobs", they will temporarily join her Condor Pool
LSF PBS Grid resources Condor your workstation personal Condor jobs Grid resources PBS LSF Condor your workstation personal Condor Condor Pool glide-in jobs Friendly Condor Pool
How It Works Personal Condor Globus Resource Schedd LSF Collector jobs How It Works Personal Condor Globus Resource Schedd LSF Collector
How It Works Personal Condor Globus Resource Schedd LSF Collector jobs How It Works Personal Condor Globus Resource Schedd GlideIn jobs LSF Collector
How It Works Personal Condor Globus Resource Schedd LSF GridManager jobs How It Works Personal Condor Globus Resource Schedd GlideIn jobs LSF GridManager Collector
How It Works Personal Condor Globus Resource JobManager Schedd LSF jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Collector
How It Works Personal Condor Globus Resource JobManager Schedd LSF jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Startd Collector
How It Works Personal Condor Globus Resource JobManager Schedd LSF jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Startd Collector
How It Works Personal Condor Globus Resource JobManager Schedd LSF jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Startd Collector User Job
GlideIn Factories GlideIn Factories Examples Examine queues of waiting jobs, submit grid universe jobs in response Examples glideinWMS (CMS, OSG) AutoPyFactory (ATLAS)
Expand into Public Clouds
Improved Scalability of Amazon EC2 grid jobs
Elastically grow your pool into the Cloud: condor_annex Start virtual machines as HTCondor execute nodes in public clouds that join your pool Leverage efficient AWS APIs such as Auto Scaling Groups and Spot Fleets Secure mechanism for cloud instances to join the HTCondor pool at home institution No matter what happens, no big money surprises!
Without condor_annex + Decide which type(s) of instances to use. + Pick a machine image, install HTCondor. + Configure HTCondor: to securely join the pool. (Coordinate with pool admin.) to shut down instance when not running a job (because of the long tail or a problem somewhere) + Decide on a bid for each instance type, according to its location (or pay more). + Configure the network and firewall at Amazon. + Implement a fail-safe in the form of a lease to make sure the pool does eventually shut itself off. + Automate response to being out-bid. + Monitor (for costs, for instances costing $ but not in pool)
with condor_annex Goal: Simplified to a single command: condor_annex --annex-id 'TheNeeds-MooreLab' \ --expiry '2015-12-18 23:59' \ --instances 1000
Cloud Elasticity at UW-Madison OSG local HTCondor scheduler :) Amazon HTCondor annex daemon CHTC
Questions?
More slides on Annex…
Bringing Cloud Elasticity to High-Throughput Scientific Applications
Cloud Elasticity at Work Amazon (60k cores) HTCondor scheduler :) :) :) Fermi (15k cores)
~60,000 cores from AWS More than 16 million core-hours in production 50,000 20,000 1/23 1/26 1/28 2/9
Elasticity Steps 1 - Make spending decisions 2 - Prepare image(s) 3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down
Elasticity at UW-Madison OSG local HTCondor scheduler :) local HTCondor scheduler :) CHTC local HTCondor scheduler :)
Motivating Example Dr. Needs-Moore needs more cycles in less time than she can get even by combining local, campus, and OSG resources. She decides she’s willing to spend some of her grant money to make this happen. She can’t spend her grant money on other people’s computation, so she needs her own “annex” in the cloud.
Cloud Elasticity at UW-Madison OSG local HTCondor scheduler :) Amazon HTCondor annex daemon CHTC
1 - Spending Decisions Identify valuable workflows and assign a value and a deadline. Policy enforcement: budget number of concurrent jobs 1 - Make spending decisions 2 - Prepare image(s) 3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down
2 - Prepare Image(s) Developers release “canonical” images. Pool administrator adjusts one to suit. Image set as default for pool’s users. HTCondor configures the instances to join the pool and securely shares the required secret at runtime. 1 - Make spending decisions 2 - Prepare image(s) 3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down
5 - Monitor How much am I spending? What am I gaining? How many instances have we started? How much does each one cost? What am I gaining? How many instances have joined the pool? Which ones haven’t? Are those instances running jobs? If not, can we tell why? Are those jobs finishing? 1 - Make spending decisions 2 - Prepare image(s) 3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down
6 - Shutdown User specifies a lease. HTCondor implements lease in the cloud. Each instance configured to shut itself off if has no work to do. 1 - Make spending decisions 2 - Prepare image(s) 3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down
Status Elasticity demonstrated at medium scale. (Only 50-60 thousand cores.) Prototype of end-user tool developed. Demonstrated at HTCondor Week 2016. Developing faster and more scalable mechanism for cloud provisioning. Designing production tool for campus use.
tlmiller@cs.wisc.edu :) HTCondor OSG local scheduler HTCondor Amazon annex daemon CHTC