Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")

Talking Points: Dynamic Extension of HTCondor pools INFN HTCondor Workshop Oct 2016

Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Expand via "Glideins" Submission of a pilot to grid universe Expand into Public Clouds

History: Always dynamic…
Cluster Node master startd = Process Spawned Central Manager (Frieda’s) master collector negotiator schedd startd = ClassAd Communication Pathway Desktop schedd startd master

Expand via Flocking

your workstation personal Condor Condor Pool 600 Condor jobs
Friendly Condor Pool

Collector Negotiator Collector Negotiator Collector Negotiator Schedd
Friendly destination pool adds to condor_config FLOCK_FROM = your.cm.edu You add line to your condor_config : FLOCK_TO = Pool-Foo.edu, Pool-Bar.edu Collector Negotiator Collector Negotiator Collector Negotiator Submit Machine Central Manager (CONDOR_HOST) Pool-Foo Central Manager Pool-Bar Central Manager Schedd

Condor Flocking Remote pools are contacted in the order specified until jobs are satisfied The list of remote pools is a property of the Schedd, not the Central Manager So different users can Flock to different pools And remote pools can allow specific users User-priority system is “flocking-aware” A pool’s local users can have priority over remote users “flocking” in.

Condor Flocking, cont. Flocking is “Condor” specific technology…
Frieda also has access to Globus resources she wants to use She has certificates and access to Globus gatekeepers at remote institutions But Frieda wants Condor’s queue management features for her Globus jobs! She installs Condor-G so she can submit “Globus Universe” jobs to Condor

Network Considerations
Can only flock to sites where nodes have outgoing network connectivity Incoming not required if remote site enabled CCB Want to support sending jobs to remote sites? Central manager and submit machines should be on public IP addresses Want to enable mixed mode IPv4 / IPv6? Central manager and submit machines should be on dual-homed (IPv4 and IPv6 connected) machines

Grid Universe in HTCondor ("Condor-G")

Grid Universe (Condor-G)
Reliable, durable submission of a job to a remote scheduler Popular way to send pilot jobs, key component of HTCondor- CE Supports many “back end” types: HTCondor ("Condor-C") PBS LSF Grid Engine Google Compute Engine Amazon EC2 OpenStack Cream NorduGrid ARC BOINC Globus: GT2, GT5 UNICORE 11

Add Grid Universe support for SLURM, OpenStack, Cobalt
Speak native SLURM protocol No need to install PBS compatibility package Speak OpenStack’s NOVA protocol No need for EC2 compatibility layer Speak to Cobalt Scheduler Argonne Leadership Computing Facilities Jaime: Grid Jedi

One Solution: Condor-G GlideIn
Frieda needs a bigger Condor pool. She can use the Grid Universe to run Condor daemons on remote clusters When the resources run these "GlideIn jobs", they will temporarily join her Condor Pool

LSF PBS Grid resources Condor your workstation personal Condor
jobs Grid resources PBS LSF Condor your workstation personal Condor Condor Pool glide-in jobs Friendly Condor Pool

How It Works Personal Condor Globus Resource Schedd LSF Collector
jobs How It Works Personal Condor Globus Resource Schedd LSF Collector

How It Works Personal Condor Globus Resource Schedd LSF Collector
jobs How It Works Personal Condor Globus Resource Schedd GlideIn jobs LSF Collector

How It Works Personal Condor Globus Resource Schedd LSF GridManager
jobs How It Works Personal Condor Globus Resource Schedd GlideIn jobs LSF GridManager Collector

How It Works Personal Condor Globus Resource JobManager Schedd LSF
jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Collector

jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Startd Collector

jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Startd Collector User Job

GlideIn Factories GlideIn Factories Examples
Examine queues of waiting jobs, submit grid universe jobs in response Examples glideinWMS (CMS, OSG) AutoPyFactory (ATLAS)

Expand into Public Clouds

Improved Scalability of Amazon EC2 grid jobs

Elastically grow your pool into the Cloud: condor_annex
Start virtual machines as HTCondor execute nodes in public clouds that join your pool Leverage efficient AWS APIs such as Auto Scaling Groups and Spot Fleets Secure mechanism for cloud instances to join the HTCondor pool at home institution No matter what happens, no big money surprises!

Without condor_annex + Decide which type(s) of instances to use.
+ Pick a machine image, install HTCondor. + Configure HTCondor: to securely join the pool. (Coordinate with pool admin.) to shut down instance when not running a job (because of the long tail or a problem somewhere) + Decide on a bid for each instance type, according to its location (or pay more). + Configure the network and firewall at Amazon. + Implement a fail-safe in the form of a lease to make sure the pool does eventually shut itself off. + Automate response to being out-bid. + Monitor (for costs, for instances costing $ but not in pool)

with condor_annex Goal: Simplified to a single command:
condor_annex --annex-id 'TheNeeds-MooreLab' \ --expiry ' :59' \ --instances 1000

Cloud Elasticity at UW-Madison
OSG local HTCondor scheduler :) Amazon HTCondor annex daemon CHTC

Questions?

Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")

Similar presentations

Presentation on theme: "Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")

Similar presentations

Presentation on theme: "Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")"— Presentation transcript:

Similar presentations

About project

Feedback