Download presentation
Presentation is loading. Please wait.
1
Talking Points: HA Configurations for HTCondor Services INFN HTCondor Workshop Oct 2016
2
High Availability in HTCondor
Discuss High Availability of Central Manager (Collector, Negotiator, CCB) Submit node (Schedd) Execute node (Startd)
3
HA of Central Manager What happens if Central Manager server fails?
condor_status fails Unclaimed slots stay idle (no new matches made) However… Jobs keep running!! And new jobs are launched on claimed slots!!! until claim on slot is broken condor_q, condor_submit, condor_rm all continue working When Central Manager restarted, all state restored within a few minutes
4
Auto Failover of Central Manager
Three options Don't worry about it Use your data center failover system (VMWare, RHEL Cluster Suite, Mesos+Marathan, …). Nothing fancy; disk in SAN, reboot if dead. Use HTCondor's condor_had mechanism HTCondor's CM HAD Approach CM has two services: collector, negotiator Collector, including CCB, is active/active - daemons connect to both CMs, tools randomly pick a live one to use (load balance) Negotiator is active/passive (can only have one active negotiator per pool), controlled by condor_had daemon. Can be primary/secondary, or peer/peer. Negotiator state (user usage, priorities) replicated and re-merged by condor_replication
5
Central Manager 1 Master Execute Nodes Central Manager 2 Master
Collector Replication Had Negotiator Execute Nodes Central Manager 2 Master Collector Replication Had
6
Central Manager 1 Master Execute Nodes Central Manager 2 Master
Collector Replication Had Execute Nodes Central Manager 2 Master Collector Replication Had Negotiator
7
HA of Submit Machine What happens if server running Schedd fails?
condor_q, condor_submit, condor_rm stop working However… Only impacts jobs submitted from that schedd Jobs keep running! Execute nodes will let the job keep running for the duration of the job_lease (40 min by default); even if job completes before lease expires, slot will wait idle for the schedd to reconnect and get exit status/output So if Schedd restarts within the job_lease (40min default), everything continues as normal. Useful for reboots, upgrades. Quiz: Why not make job_lease hours long?
8
Auto Failover of Schedd
Three options Don't worry about it Use your data center failover system (VMWare, RHEL Cluster Suite, Mesos+Marathan, …). Nothing fancy; disk in SAN, reboot if dead. Use mechanism in condor_master to only run one instance of a daemon Submit node is harder than CM failover because there is a lot more state State generated by HTCondor: job queue, event logs State generated by user/jobs: input files, output files To deal with all the state, we assume a shared file system between two servers Would love to try DRBD (distributed replicated block device)… Failover is peer/peer
9
condor_submit -name MySchedd
Submit Machine 1 Master Collector Schedd Shared Filesystem condor_submit -name MySchedd Master Schedd Submit Machine 2
10
HA of Execute Node What happens if server running a Startd fails?
Jobs running there will be get restarted someplace else Data written to local disk by that node still there… if privacy concern, HTCondor can encrypt job I/O on the fly via "encrypt_execute_directory=true" Machine classads removed from the collector Optionally can stick around marked as "absent", visible with "condor_status -absent"
11
Take Aways Failure of the central manager is not catastrophic unless it is down for quite some time (many minutes / maybe hours?). Lost throughput due to failure of a submit node can be minimized by restarting the submit node within job_lease minutes, or by splitting up jobs across multiple submit nodes. HA failover is available in HTCondor for both central manager, CCB, schedd --- but the schedd failover requires a shared file system UW-Madison doesn't bother with any failover; CMS Global pool uses CM and CCB failover.
12
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.