High Availability in HTCondor

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Grid Computing I CONDOR.

Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley

SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

Grid job submission using HTCondor Andrew Lahiff.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.

JSS Job Submission Service Massimo Sgaravatto INFN Padova.

How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.

HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

Gabi Kliot Computer Sciences Department Technion – Israel Institute of Technology Adding High Availability to Condor Central Manager Adding High Availability.

Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.

HTCondor Annex (There are many clouds like it, but this one is mine.)

Debugging Common Problems in HTCondor

Condor DAGMan: Managing Job Dependencies with Condor

HTCondor Networking Concepts

HTCondor Networking Concepts

HTCondor Security Basics

Quick Architecture Overview INFN HTCondor Workshop Oct 2016

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Intermediate HTCondor: Workflows Monday pm

Examples Example: UW-Madison CHTC Example: Global CMS Pool

Things you may not know about HTCondor

Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")

Workload Management System

IW2D migration to HTCondor

HP ArcSight ESM 6.8c HA Fail Over Illustrated

Things you may not know about HTCondor

Migratory File Services for Batch-Pipelined Workloads

Adding High Availability to Condor Central Manager Tutorial

Building Grids with Condor

Accounting in HTCondor

Software Engineering Introduction to Apache Hadoop Map Reduce

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

SQL Server High Availability Amit Vaid.

Troubleshooting Your Jobs

Negotiator Policy and Configuration

Accounting, Group Quotas, and User Priorities

湖南大学-信息科学与工程学院-计算机与科学系

Haiyan Meng and Douglas Thain

HTCondor Security Basics HTCondor Week, Madison 2016

Condor Glidein: Condor Daemons On-The-Fly

Basic Grid Projects – Condor (Part I)

Upgrading Condor Best Practices

Introduction to High Throughput Computing and HTCondor

Brian Lin OSG Software Team University of Wisconsin - Madison

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

Condor: Firewall Mirroring

Condor-G Making Condor Grid Enabled

Job Submission Via File Transfer

PU. Setting up parallel universe in your pool and when (not

Presentation transcript:

Talking Points: HA Configurations for HTCondor Services INFN HTCondor Workshop Oct 2016

High Availability in HTCondor Discuss High Availability of Central Manager (Collector, Negotiator, CCB) Submit node (Schedd) Execute node (Startd)

HA of Central Manager What happens if Central Manager server fails? condor_status fails Unclaimed slots stay idle (no new matches made) However… Jobs keep running!! And new jobs are launched on claimed slots!!! until claim on slot is broken condor_q, condor_submit, condor_rm all continue working When Central Manager restarted, all state restored within a few minutes

Auto Failover of Central Manager Three options Don't worry about it Use your data center failover system (VMWare, RHEL Cluster Suite, Mesos+Marathan, …). Nothing fancy; disk in SAN, reboot if dead. Use HTCondor's condor_had mechanism HTCondor's CM HAD Approach CM has two services: collector, negotiator Collector, including CCB, is active/active - daemons connect to both CMs, tools randomly pick a live one to use (load balance) Negotiator is active/passive (can only have one active negotiator per pool), controlled by condor_had daemon. Can be primary/secondary, or peer/peer. Negotiator state (user usage, priorities) replicated and re-merged by condor_replication

Central Manager 1 Master Execute Nodes Central Manager 2 Master Collector Replication Had Negotiator Execute Nodes Central Manager 2 Master Collector Replication Had

Central Manager 1 Master Execute Nodes Central Manager 2 Master Collector Replication Had Execute Nodes Central Manager 2 Master Collector Replication Had Negotiator

HA of Submit Machine What happens if server running Schedd fails? condor_q, condor_submit, condor_rm stop working However… Only impacts jobs submitted from that schedd Jobs keep running! Execute nodes will let the job keep running for the duration of the job_lease (40 min by default); even if job completes before lease expires, slot will wait idle for the schedd to reconnect and get exit status/output So if Schedd restarts within the job_lease (40min default), everything continues as normal. Useful for reboots, upgrades. Quiz: Why not make job_lease hours long?

Auto Failover of Schedd Three options Don't worry about it Use your data center failover system (VMWare, RHEL Cluster Suite, Mesos+Marathan, …). Nothing fancy; disk in SAN, reboot if dead. Use mechanism in condor_master to only run one instance of a daemon Submit node is harder than CM failover because there is a lot more state State generated by HTCondor: job queue, event logs State generated by user/jobs: input files, output files To deal with all the state, we assume a shared file system between two servers Would love to try DRBD (distributed replicated block device)… Failover is peer/peer

condor_submit -name MySchedd Submit Machine 1 Master Collector Schedd Shared Filesystem condor_submit -name MySchedd Master Schedd Submit Machine 2

HA of Execute Node What happens if server running a Startd fails? Jobs running there will be get restarted someplace else Data written to local disk by that node still there… if privacy concern, HTCondor can encrypt job I/O on the fly via "encrypt_execute_directory=true" Machine classads removed from the collector Optionally can stick around marked as "absent", visible with "condor_status -absent"

Take Aways Failure of the central manager is not catastrophic unless it is down for quite some time (many minutes / maybe hours?). Lost throughput due to failure of a submit node can be minimized by restarting the submit node within job_lease minutes, or by splitting up jobs across multiple submit nodes. HA failover is available in HTCondor for both central manager, CCB, schedd --- but the schedd failover requires a shared file system  UW-Madison doesn't bother with any failover; CMS Global pool uses CM and CCB failover.

Questions?