Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
HTCondor / HEP Partnership and Activities HEPiX Fall 2014 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University.
Alain Roy Computer Sciences Department University of Wisconsin-Madison 25-June-2002 Using Condor on the Grid.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Grid job submission using HTCondor Andrew Lahiff.
Condor Week 2005Optimizing Workflows on the Grid1 Optimizing workflow execution on the Grid Gaurang Mehta - Based on “Optimizing.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
What’s new in HTCondor. What’s coming
Hello everyone I am rajul and I’ll be speaking on a case study of elastic slurm Case Study of Elastic Slurm Rajul Kumar, Northeastern University
CHTC Policy and Configuration
Review of the WLCG experiments compute plans
HTCondor Annex (There are many clouds like it, but this one is mine.)
HTCondor Networking Concepts
AWS Integration in Distributed Computing
Elastic Computing Resource Management Based on HTCondor
HTCondor Networking Concepts
HTCondor Security Basics
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Peter Kacsuk – Sipos Gergely MTA SZTAKI
ATLAS Cloud Operations
Workload Management System
Provisioning 160,000 cores with HEPCloud at SC17
High Availability in HTCondor
Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel
Building Grids with Condor
Condor: Job Management
FCT Follow-up Meeting 31 March, 2017 Fernando Meireles
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
HTCondor Security Basics HTCondor Week, Madison 2016
What’s new in HTCondor. What’s coming
Basic Grid Projects – Condor (Part I)
Brian Lin OSG Software Team University of Wisconsin - Madison
Condor: Firewall Mirroring
GRID Workload Management System for CMS fall production
Grid Laboratory Of Wisconsin (GLOW)
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
JRA 1 Progress Report ETICS 2 All-Hands Meeting
Job Submission Via File Transfer
Condor-G: An Update.
Presentation transcript:

Talking Points: Dynamic Extension of HTCondor pools INFN HTCondor Workshop Oct 2016

Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G") Expand via "Glideins" Submission of a pilot to grid universe Expand into Public Clouds

History: Always dynamic… Cluster Node master startd = Process Spawned Central Manager (Frieda’s) master collector negotiator schedd startd = ClassAd Communication Pathway Desktop schedd startd master

Expand via Flocking

your workstation personal Condor Condor Pool 600 Condor jobs Friendly Condor Pool

Collector Negotiator Collector Negotiator Collector Negotiator Schedd Friendly destination pool adds to condor_config FLOCK_FROM = your.cm.edu You add line to your condor_config : FLOCK_TO = Pool-Foo.edu, Pool-Bar.edu Collector Negotiator Collector Negotiator Collector Negotiator Submit Machine Central Manager (CONDOR_HOST) Pool-Foo Central Manager Pool-Bar Central Manager Schedd

Condor Flocking Remote pools are contacted in the order specified until jobs are satisfied The list of remote pools is a property of the Schedd, not the Central Manager So different users can Flock to different pools And remote pools can allow specific users User-priority system is “flocking-aware” A pool’s local users can have priority over remote users “flocking” in.

Condor Flocking, cont. Flocking is “Condor” specific technology… Frieda also has access to Globus resources she wants to use She has certificates and access to Globus gatekeepers at remote institutions But Frieda wants Condor’s queue management features for her Globus jobs! She installs Condor-G so she can submit “Globus Universe” jobs to Condor

Network Considerations Can only flock to sites where nodes have outgoing network connectivity Incoming not required if remote site enabled CCB Want to support sending jobs to remote sites? Central manager and submit machines should be on public IP addresses Want to enable mixed mode IPv4 / IPv6? Central manager and submit machines should be on dual-homed (IPv4 and IPv6 connected) machines

Grid Universe in HTCondor ("Condor-G")

Grid Universe (Condor-G) Reliable, durable submission of a job to a remote scheduler Popular way to send pilot jobs, key component of HTCondor- CE Supports many “back end” types: HTCondor ("Condor-C") PBS LSF Grid Engine Google Compute Engine Amazon EC2 OpenStack Cream NorduGrid ARC BOINC Globus: GT2, GT5 UNICORE 11

Add Grid Universe support for SLURM, OpenStack, Cobalt Speak native SLURM protocol No need to install PBS compatibility package Speak OpenStack’s NOVA protocol No need for EC2 compatibility layer Speak to Cobalt Scheduler Argonne Leadership Computing Facilities Jaime: Grid Jedi

One Solution: Condor-G GlideIn Frieda needs a bigger Condor pool. She can use the Grid Universe to run Condor daemons on remote clusters When the resources run these "GlideIn jobs", they will temporarily join her Condor Pool

LSF PBS Grid resources Condor your workstation personal Condor jobs Grid resources PBS LSF Condor your workstation personal Condor Condor Pool glide-in jobs Friendly Condor Pool

How It Works Personal Condor Globus Resource Schedd LSF Collector jobs How It Works Personal Condor Globus Resource Schedd LSF Collector

How It Works Personal Condor Globus Resource Schedd LSF Collector jobs How It Works Personal Condor Globus Resource Schedd GlideIn jobs LSF Collector

How It Works Personal Condor Globus Resource Schedd LSF GridManager jobs How It Works Personal Condor Globus Resource Schedd GlideIn jobs LSF GridManager Collector

How It Works Personal Condor Globus Resource JobManager Schedd LSF jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Collector

How It Works Personal Condor Globus Resource JobManager Schedd LSF jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Startd Collector

How It Works Personal Condor Globus Resource JobManager Schedd LSF jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Startd Collector

How It Works Personal Condor Globus Resource JobManager Schedd LSF jobs How It Works Personal Condor Globus Resource JobManager Schedd GlideIn jobs LSF GridManager Startd Collector User Job

GlideIn Factories GlideIn Factories Examples Examine queues of waiting jobs, submit grid universe jobs in response Examples glideinWMS (CMS, OSG) AutoPyFactory (ATLAS)

Expand into Public Clouds

Improved Scalability of Amazon EC2 grid jobs

Elastically grow your pool into the Cloud: condor_annex Start virtual machines as HTCondor execute nodes in public clouds that join your pool Leverage efficient AWS APIs such as Auto Scaling Groups and Spot Fleets Secure mechanism for cloud instances to join the HTCondor pool at home institution No matter what happens, no big money surprises!

Without condor_annex + Decide which type(s) of instances to use. + Pick a machine image, install HTCondor. + Configure HTCondor: to securely join the pool. (Coordinate with pool admin.) to shut down instance when not running a job (because of the long tail or a problem somewhere) + Decide on a bid for each instance type, according to its location (or pay more). + Configure the network and firewall at Amazon. + Implement a fail-safe in the form of a lease to make sure the pool does eventually shut itself off. + Automate response to being out-bid. + Monitor (for costs, for instances costing $ but not in pool)

with condor_annex Goal: Simplified to a single command: condor_annex --annex-id 'TheNeeds-MooreLab' \ --expiry '2015-12-18 23:59' \ --instances 1000

Cloud Elasticity at UW-Madison OSG local HTCondor scheduler :) Amazon HTCondor annex daemon CHTC

Questions?

More slides on Annex…

Bringing Cloud Elasticity to High-Throughput Scientific Applications

Cloud Elasticity at Work Amazon (60k cores) HTCondor scheduler :) :) :) Fermi (15k cores)

~60,000 cores from AWS More than 16 million core-hours in production 50,000 20,000 1/23 1/26 1/28 2/9

Elasticity Steps 1 - Make spending decisions 2 - Prepare image(s) 3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down

Elasticity at UW-Madison OSG local HTCondor scheduler :) local HTCondor scheduler :) CHTC local HTCondor scheduler :)

Motivating Example Dr. Needs-Moore needs more cycles in less time than she can get even by combining local, campus, and OSG resources. She decides she’s willing to spend some of her grant money to make this happen. She can’t spend her grant money on other people’s computation, so she needs her own “annex” in the cloud.

Cloud Elasticity at UW-Madison OSG local HTCondor scheduler :) Amazon HTCondor annex daemon CHTC

1 - Spending Decisions Identify valuable workflows and assign a value and a deadline. Policy enforcement: budget number of concurrent jobs 1 - Make spending decisions 2 - Prepare image(s) 3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down

2 - Prepare Image(s) Developers release “canonical” images. Pool administrator adjusts one to suit. Image set as default for pool’s users. HTCondor configures the instances to join the pool and securely shares the required secret at runtime. 1 - Make spending decisions 2 - Prepare image(s) 3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down

5 - Monitor How much am I spending? What am I gaining? How many instances have we started? How much does each one cost? What am I gaining? How many instances have joined the pool? Which ones haven’t? Are those instances running jobs? If not, can we tell why? Are those jobs finishing? 1 - Make spending decisions 2 - Prepare image(s) 3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down

6 - Shutdown User specifies a lease. HTCondor implements lease in the cloud. Each instance configured to shut itself off if has no work to do. 1 - Make spending decisions 2 - Prepare image(s) 3 - Provision instances 4 - Run jobs 5 - Monitor 6 - Shut down

Status Elasticity demonstrated at medium scale. (Only 50-60 thousand cores.) Prototype of end-user tool developed. Demonstrated at HTCondor Week 2016. Developing faster and more scalable mechanism for cloud provisioning. Designing production tool for campus use.

tlmiller@cs.wisc.edu :) HTCondor OSG local scheduler HTCondor Amazon annex daemon CHTC