HTCondor Annex (There are many clouds like it, but this one is mine.)

Slides:



Advertisements
Similar presentations
University of Notre Dame
Advertisements

Windows Deployment Services WDS for Large Scale Enterprises and Small IT Shops Presented By: Ryan Drown Systems Administrator for Krannert.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
Operating Systems.
An Introduction to Cloud Computing. The challenge Add new services for your users quickly and cost effectively.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Chapter-4 Windows 2000 Professional Win2K Professional provides a very usable interface and was designed for use in the desktop PC. Microsoft server system.
Hands-On Microsoft Windows Server 2008
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
Module 7: Fundamentals of Administering Windows Server 2008.
Ian Alderman A Little History…
Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.
Core 3: Communication Systems. Network software includes the Network Operating Software (NOS) and also network based applications such as those running.
Virtualization Technology and Microsoft Virtual PC 2007 YOU ARE WELCOME By : Osama Tamimi.
SQL SERVER 2008 Installation Guide A Step by Step Guide Prepared by Hassan Tariq.
Install, configure and test ICT Networks
Launch Amazon Instance. Amazon EC2 Amazon Elastic Compute Cloud (Amazon EC2) provides resizable computing capacity in the Amazon Web Services (AWS) cloud.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
IPEmotion License Management PM (V1.2).
© 2015 MetricStream, Inc. All Rights Reserved. AWS server provisioning © 2015 MetricStream, Inc. All Rights Reserved. By, Srikanth K & Rohit.
SEMINAR ON.  OVERVIEW -  What is Cloud Computing???  Amazon Elastic Cloud Computing (Amazon EC2)  Amazon EC2 Core Concept  How to use Amazon EC2.
Linux & UNIX OS Overview Fort Collins, CO Copyright © XTR Systems, LLC Overview of the Linux & UNIX Operating Systems Instructor: Joseph DiVerdi, Ph.D.,
Working with Windows 7 at CERN
Network customization
Chapter 7: Using Windows Servers
Unit 3 Virtualization.
Chapter 6: Securing the Cloud
Cloud Computing Q&A Presented by:
Resource Management IB Computer Science.
Processes and threads.
Guide to Linux Installation and Administration, 2e
AWS Integration in Distributed Computing
HTCondor Security Basics
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Introduction to Operating Systems
Mechanism: Limited Direct Execution
Outline What does the OS protect? Authentication for operating systems
High Availability in HTCondor
Tools and Services Workshop Overview of Atmosphere
THE STEPS TO MANAGE THE GRID
Introduction to Networking
Introduction to Computers
Introduction to Computers
Creating a Windows Server 2012 R2 Datacenter Virtual machine
Creating a Windows Server 2016 Datacenter Virtual machine
Outline What does the OS protect? Authentication for operating systems
Introduction to Computers
Introduction to Cloud Computing
Main Memory Management
Grid Means Business OGF-20, Manchester, May 2007
DHCP, DNS, Client Connection, Assignment 1 1.3
Dev Test on Windows Azure Solution in a Box
Managing Clouds with VMM
HDFS on Kubernetes -- Lessons Learned
HTCondor Security Basics HTCondor Week, Madison 2016
Fault Tolerance Distributed Web-based Systems
HDFS on Kubernetes -- Lessons Learned
IBM Power Systems.
Chapter 2: Operating-System Structures
Prof. Leonardo Mostarda University of Camerino
MAINTAINING SERVER AVAILIBILITY
16. Account Monitoring and Control
Database System Architectures
Network customization
Chapter 2: Operating-System Structures
Mechanism: Address Translation
Lecture Topics: 11/1 Hand back midterms
Setting up PostgreSQL for Production in AWS
OU BATTLECARD: Oracle Systems Learning Subscription
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

HTCondor Annex (There are many clouds like it, but this one is mine.)

Annex means (an) Addition An annex is “a building joined to main building, providing additional space or accommodations” An HTCondor annex could provide: more machines specialized hardware specialized policies Use condor_annex to acquire computational resources from the cloud

What is the cloud? Commercial services which rent computational resources by the hour They own the hardware You provide the software (“disk image”) (OS, applications, configuration, maybe data) You can configure the networking and storage as well A virtual machine – if I say that instead of “computational resource” – is a way to time-share hardware such that nobody can tell they don’t have a full machine.

Why not keep using the Grid? Cloud resources are typically available sooner and in greater quantity Cloud resources are more customizable (networking, software, configuration/policy, etc)

Intended for Users The condor_annex tool was first released this year, in HTCondor 8.7.0 Improved in 8.7.1 and still under active development To add a GPU to the pool: condor_annex -count 1 \ -annex-name ToddsGPU \ -aws-on-demand-instance-type p2.xlarge

Use Case 1: Deadlines How important is that user’s deadline? Is she willing to spend money on it? Make it easy for the user to run jobs in the cloud, trading money for job completion automation sane defaults admin configuration Be sure to say right away that this use cases addresses users directly.

Use Case 2: Capability Meet intermittent needs for hardware with lots (TBs) of memory with GPUs with fast local storage of shared data especially if one of the AWS public data sets Special job policies, like long runtimes This use case is for both users and admins. The AWS public data sets include, for example, Landsat, the Sloan digital sky survey, or the 1000 genomes project.

Use Case 3: Capacity Lower costs through higher utilization, with cloud rentals covering usage bursts Without condor_annex, expanding an HTCondor pool into the cloud isn’t easy This use case is for admins.

A brief overview of the Annex lifecycle

Annex Lifecycle User requests resources Then condor_annex starts resources Resources join pool Resources stop spending money If they become idle After the pre-selected duration

1. Request Resources User requests may specify: Two types of resource hardware (CPUs, memory, disk, GPUs) software (OS, applications, configuration, data) number of resources and maximum lifetime Two types of resource on-demand: pricier, yours until you stop them spot: cheaper, can be lost to a higher bidder after a two-minute warning only suitable for short or resumable jobs Generally, we’d like to be able to have the HTCondor admin tweak the default for their users. For instance, if all of your pool’s machines have R installed, you might want to modify the annex default image to include R.

(An aside: Spot Fleet) Amazon offers, and condor_annex supports, a mechanism called “Spot Fleet” A “Spot Fleet” automatically chooses the cheapest way to satisfy spot resource requests which aren’t picky about their hardware requirements

2. Start Resources condor_annex machinery starts each instance a “client token” (intended for fault tolerance) is used to indelibly mark each resource as part of a particular annex credentials for joining you pool are securely communicated instance “user data” is left available for your use The “role” is one of the details handled by the set-up instructions.

3. Resource Securely Joins Pool Resource boots up Credentials are securely retrieved and written to disk HTCondor starts up, reports to your central manager condor_status -annex ToddsGPU

4. Resources Stop Spending Fail-safe: the resources always stop Even if the user’s machine goes offline Implemented entirely in the cloud (Uses AWS Lambda and CloudWatch Events) Checks the duration every five minutes (Uses “client token” to identify annex instances) condor_off -master -annex ToddsGPU The initial set-up instructions create the Lambda function; the per-annex timer is set up by condor_annex.

Opportunities for Improvement Only works with Amazon Can’t change annex duration without adding nodes Requires admin help to run jobs from an existing pool

Customization

Disk Image Customization A resource must have a disk image (OS, applications, configuration, maybe data) HTCondor provides a default disk image that should work for most users If you create disk images for your users, you can copy and customize the default image for them, or make your own from scratch, subject to a few restrictions We’d like to have condor_annex automatically pick up on its pool’s default image, but that’s at least another version away.

Disk Image Requirements The default disk image does all this Start-up to fetch config and security data currently requires AWS CLI tool HTCondor configured to turn off when it’s idle for too long. STARTD_NOCLAIM_SHUTDOWN HTCondor configured to turn instance off when the master exits. DEFAULT_MASTER_SHUTDOWN_SCRIPT So that’s not a whole lot that’s required; you have a great deal of freedom in making an annex-compatible image.

Image Suggestions The default disk image does all this Advertise instance ID in master and startd Use public IP addresses and set TCP_FORWARDING_HOST Turn communications integrity and encryption on Encrypt the run directories

What can you do today?

Initial Set-Up Follow the initial set-up instructions to connect condor_annex to an AWS account via HTCondor configuration Assumptions (mostly for simplicity): new, private HTCondor pool public IP address, open port Linux https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p= UsingCondorAnnexForTheFirstTimeEightSevenOne