Presentation is loading. Please wait.

Presentation is loading. Please wait.

Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")

Similar presentations


Presentation on theme: "Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")"— Presentation transcript:

1 Talking Points: Dynamic Extension of HTCondor pools INFN HTCondor Workshop Oct 2016

2 Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Expand via "Glideins" Submission of a pilot to grid universe Expand into Public Clouds

3 History: Always dynamic…
Cluster Node master startd = Process Spawned Central Manager (Frieda’s) master collector negotiator schedd startd = ClassAd Communication Pathway Desktop schedd startd master

4 Expand via Flocking

5 your workstation personal Condor Condor Pool 600 Condor jobs
Friendly Condor Pool

6 Collector Negotiator Collector Negotiator Collector Negotiator Schedd
Friendly destination pool adds to condor_config FLOCK_FROM = your.cm.edu You add line to your condor_config : FLOCK_TO = Pool-Foo.edu, Pool-Bar.edu Collector Negotiator Collector Negotiator Collector Negotiator Submit Machine Central Manager (CONDOR_HOST) Pool-Foo Central Manager Pool-Bar Central Manager Schedd

7 Condor Flocking Remote pools are contacted in the order specified until jobs are satisfied The list of remote pools is a property of the Schedd, not the Central Manager So different users can Flock to different pools And remote pools can allow specific users User-priority system is “flocking-aware” A pool’s local users can have priority over remote users “flocking” in.

8 Condor Flocking, cont. Flocking is “Condor” specific technology…
Frieda also has access to Globus resources she wants to use She has certificates and access to Globus gatekeepers at remote institutions But Frieda wants Condor’s queue management features for her Globus jobs! She installs Condor-G so she can submit “Globus Universe” jobs to Condor

9 Network Considerations
Can only flock to sites where nodes have outgoing network connectivity Incoming not required if remote site enabled CCB Want to support sending jobs to remote sites? Central manager and submit machines should be on public IP addresses Want to enable mixed mode IPv4 / IPv6? Central manager and submit machines should be on dual-homed (IPv4 and IPv6 connected) machines

10 Grid Universe in HTCondor ("Condor-G")

11 Grid Universe (Condor-G)
Reliable, durable submission of a job to a remote scheduler Popular way to send pilot jobs, key component of HTCondor- CE Supports many “back end” types: HTCondor ("Condor-C") PBS LSF Grid Engine Google Compute Engine Amazon EC2 OpenStack Cream NorduGrid ARC BOINC Globus: GT2, GT5 UNICORE 11

12 Add Grid Universe support for SLURM, OpenStack, Cobalt
Speak native SLURM protocol No need to install PBS compatibility package Speak OpenStack’s NOVA protocol No need for EC2 compatibility layer Speak to Cobalt Scheduler Argonne Leadership Computing Facilities Jaime: Grid Jedi

13 One Solution: Condor-G GlideIn
Frieda needs a bigger Condor pool. She can use the Grid Universe to run Condor daemons on remote clusters When the resources run these "GlideIn jobs", they will temporarily join her Condor Pool

14 LSF PBS Grid resources Condor your workstation personal Condor
jobs Grid resources PBS LSF Condor your workstation personal Condor Condor Pool glide-in jobs Friendly Condor Pool

15 How It Works Personal Condor Globus Resource Schedd LSF Collector
jobs How It Works Personal Condor Globus Resource Schedd LSF Collector

16 How It Works Personal Condor Globus Resource Schedd LSF Collector
jobs How It Works Personal Condor Globus Resource Schedd GlideIn jobs LSF Collector

17 How It Works Personal Condor Globus Resource Schedd LSF GridManager
jobs How It Works Personal Condor Globus Resource Schedd GlideIn jobs LSF GridManager Collector

18 How It Works Personal Condor Globus Resource JobManager Schedd LSF
jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Collector

19 How It Works Personal Condor Globus Resource JobManager Schedd LSF
jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Startd Collector

20 How It Works Personal Condor Globus Resource JobManager Schedd LSF
jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Startd Collector

21 How It Works Personal Condor Globus Resource JobManager Schedd LSF
jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Startd Collector User Job

22 GlideIn Factories GlideIn Factories Examples
Examine queues of waiting jobs, submit grid universe jobs in response Examples glideinWMS (CMS, OSG) AutoPyFactory (ATLAS)

23 Expand into Public Clouds

24 Improved Scalability of Amazon EC2 grid jobs

25 Elastically grow your pool into the Cloud: condor_annex
Start virtual machines as HTCondor execute nodes in public clouds that join your pool Leverage efficient AWS APIs such as Auto Scaling Groups and Spot Fleets Secure mechanism for cloud instances to join the HTCondor pool at home institution No matter what happens, no big money surprises!

26 Without condor_annex + Decide which type(s) of instances to use.
+ Pick a machine image, install HTCondor. + Configure HTCondor: to securely join the pool. (Coordinate with pool admin.) to shut down instance when not running a job (because of the long tail or a problem somewhere) + Decide on a bid for each instance type, according to its location (or pay more). + Configure the network and firewall at Amazon. + Implement a fail-safe in the form of a lease to make sure the pool does eventually shut itself off. + Automate response to being out-bid. + Monitor (for costs, for instances costing $ but not in pool)

27 with condor_annex Goal: Simplified to a single command:
condor_annex --annex-id 'TheNeeds-MooreLab' \ --expiry ' :59' \ --instances 1000

28 Cloud Elasticity at UW-Madison
OSG local HTCondor scheduler :) Amazon HTCondor annex daemon CHTC

29 Questions?

30 More slides on Annex…

31 Bringing Cloud Elasticity to High-Throughput Scientific Applications

32 Cloud Elasticity at Work
Amazon (60k cores) HTCondor scheduler :) :) :) Fermi (15k cores)

33 ~60,000 cores from AWS More than 16 million core-hours in production
50,000 20,000 1/ / / /9

34 Elasticity Steps 1 - Make spending decisions 2 - Prepare image(s)
3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down

35 Elasticity at UW-Madison
OSG local HTCondor scheduler :) local HTCondor scheduler :) CHTC local HTCondor scheduler :)

36 Motivating Example Dr. Needs-Moore needs more cycles in less time than she can get even by combining local, campus, and OSG resources. She decides she’s willing to spend some of her grant money to make this happen. She can’t spend her grant money on other people’s computation, so she needs her own “annex” in the cloud.

37 Cloud Elasticity at UW-Madison
OSG local HTCondor scheduler :) Amazon HTCondor annex daemon CHTC

38 1 - Spending Decisions Identify valuable workflows and assign a value and a deadline. Policy enforcement: budget number of concurrent jobs 1 - Make spending decisions 2 - Prepare image(s) 3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down

39 2 - Prepare Image(s) Developers release “canonical” images.
Pool administrator adjusts one to suit. Image set as default for pool’s users. HTCondor configures the instances to join the pool and securely shares the required secret at runtime. 1 - Make spending decisions 2 - Prepare image(s) 3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down

40 5 - Monitor How much am I spending? What am I gaining?
How many instances have we started? How much does each one cost? What am I gaining? How many instances have joined the pool? Which ones haven’t? Are those instances running jobs? If not, can we tell why? Are those jobs finishing? 1 - Make spending decisions 2 - Prepare image(s) 3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down

41 6 - Shutdown User specifies a lease.
HTCondor implements lease in the cloud. Each instance configured to shut itself off if has no work to do. 1 - Make spending decisions 2 - Prepare image(s) 3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down

42 Status Elasticity demonstrated at medium scale.
(Only thousand cores.) Prototype of end-user tool developed. Demonstrated at HTCondor Week 2016. Developing faster and more scalable mechanism for cloud provisioning. Designing production tool for campus use.

43 tlmiller@cs.wisc.edu :) HTCondor OSG local scheduler HTCondor Amazon
annex daemon CHTC


Download ppt "Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")"

Similar presentations


Ads by Google