Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Availability in HTCondor

Similar presentations


Presentation on theme: "High Availability in HTCondor"— Presentation transcript:

1 Talking Points: HA Configurations for HTCondor Services INFN HTCondor Workshop Oct 2016

2 High Availability in HTCondor
Discuss High Availability of Central Manager (Collector, Negotiator, CCB) Submit node (Schedd) Execute node (Startd)

3 HA of Central Manager What happens if Central Manager server fails?
condor_status fails Unclaimed slots stay idle (no new matches made) However… Jobs keep running!! And new jobs are launched on claimed slots!!! until claim on slot is broken condor_q, condor_submit, condor_rm all continue working When Central Manager restarted, all state restored within a few minutes

4 Auto Failover of Central Manager
Three options Don't worry about it Use your data center failover system (VMWare, RHEL Cluster Suite, Mesos+Marathan, …). Nothing fancy; disk in SAN, reboot if dead. Use HTCondor's condor_had mechanism HTCondor's CM HAD Approach CM has two services: collector, negotiator Collector, including CCB, is active/active - daemons connect to both CMs, tools randomly pick a live one to use (load balance) Negotiator is active/passive (can only have one active negotiator per pool), controlled by condor_had daemon. Can be primary/secondary, or peer/peer. Negotiator state (user usage, priorities) replicated and re-merged by condor_replication

5 Central Manager 1 Master Execute Nodes Central Manager 2 Master
Collector Replication Had Negotiator Execute Nodes Central Manager 2 Master Collector Replication Had

6 Central Manager 1 Master Execute Nodes Central Manager 2 Master
Collector Replication Had Execute Nodes Central Manager 2 Master Collector Replication Had Negotiator

7 HA of Submit Machine What happens if server running Schedd fails?
condor_q, condor_submit, condor_rm stop working However… Only impacts jobs submitted from that schedd Jobs keep running! Execute nodes will let the job keep running for the duration of the job_lease (40 min by default); even if job completes before lease expires, slot will wait idle for the schedd to reconnect and get exit status/output So if Schedd restarts within the job_lease (40min default), everything continues as normal. Useful for reboots, upgrades. Quiz: Why not make job_lease hours long?

8 Auto Failover of Schedd
Three options Don't worry about it Use your data center failover system (VMWare, RHEL Cluster Suite, Mesos+Marathan, …). Nothing fancy; disk in SAN, reboot if dead. Use mechanism in condor_master to only run one instance of a daemon Submit node is harder than CM failover because there is a lot more state State generated by HTCondor: job queue, event logs State generated by user/jobs: input files, output files To deal with all the state, we assume a shared file system between two servers Would love to try DRBD (distributed replicated block device)… Failover is peer/peer

9 condor_submit -name MySchedd
Submit Machine 1 Master Collector Schedd Shared Filesystem condor_submit -name MySchedd Master Schedd Submit Machine 2

10 HA of Execute Node What happens if server running a Startd fails?
Jobs running there will be get restarted someplace else Data written to local disk by that node still there… if privacy concern, HTCondor can encrypt job I/O on the fly via "encrypt_execute_directory=true" Machine classads removed from the collector Optionally can stick around marked as "absent", visible with "condor_status -absent"

11 Take Aways Failure of the central manager is not catastrophic unless it is down for quite some time (many minutes / maybe hours?). Lost throughput due to failure of a submit node can be minimized by restarting the submit node within job_lease minutes, or by splitting up jobs across multiple submit nodes. HA failover is available in HTCondor for both central manager, CCB, schedd --- but the schedd failover requires a shared file system  UW-Madison doesn't bother with any failover; CMS Global pool uses CM and CCB failover.

12 Questions?


Download ppt "High Availability in HTCondor"

Similar presentations


Ads by Google