Talking Points: HA Configurations for HTCondor Services INFN HTCondor Workshop Oct 2016
High Availability in HTCondor Discuss High Availability of Central Manager (Collector, Negotiator, CCB) Submit node (Schedd) Execute node (Startd)
HA of Central Manager What happens if Central Manager server fails? condor_status fails Unclaimed slots stay idle (no new matches made) However… Jobs keep running!! And new jobs are launched on claimed slots!!! until claim on slot is broken condor_q, condor_submit, condor_rm all continue working When Central Manager restarted, all state restored within a few minutes
Auto Failover of Central Manager Three options Don't worry about it Use your data center failover system (VMWare, RHEL Cluster Suite, Mesos+Marathan, …). Nothing fancy; disk in SAN, reboot if dead. Use HTCondor's condor_had mechanism HTCondor's CM HAD Approach CM has two services: collector, negotiator Collector, including CCB, is active/active - daemons connect to both CMs, tools randomly pick a live one to use (load balance) Negotiator is active/passive (can only have one active negotiator per pool), controlled by condor_had daemon. Can be primary/secondary, or peer/peer. Negotiator state (user usage, priorities) replicated and re-merged by condor_replication
Central Manager 1 Master Execute Nodes Central Manager 2 Master Collector Replication Had Negotiator Execute Nodes Central Manager 2 Master Collector Replication Had
Central Manager 1 Master Execute Nodes Central Manager 2 Master Collector Replication Had Execute Nodes Central Manager 2 Master Collector Replication Had Negotiator
HA of Submit Machine What happens if server running Schedd fails? condor_q, condor_submit, condor_rm stop working However… Only impacts jobs submitted from that schedd Jobs keep running! Execute nodes will let the job keep running for the duration of the job_lease (40 min by default); even if job completes before lease expires, slot will wait idle for the schedd to reconnect and get exit status/output So if Schedd restarts within the job_lease (40min default), everything continues as normal. Useful for reboots, upgrades. Quiz: Why not make job_lease hours long?
Auto Failover of Schedd Three options Don't worry about it Use your data center failover system (VMWare, RHEL Cluster Suite, Mesos+Marathan, …). Nothing fancy; disk in SAN, reboot if dead. Use mechanism in condor_master to only run one instance of a daemon Submit node is harder than CM failover because there is a lot more state State generated by HTCondor: job queue, event logs State generated by user/jobs: input files, output files To deal with all the state, we assume a shared file system between two servers Would love to try DRBD (distributed replicated block device)… Failover is peer/peer
condor_submit -name MySchedd Submit Machine 1 Master Collector Schedd Shared Filesystem condor_submit -name MySchedd Master Schedd Submit Machine 2
HA of Execute Node What happens if server running a Startd fails? Jobs running there will be get restarted someplace else Data written to local disk by that node still there… if privacy concern, HTCondor can encrypt job I/O on the fly via "encrypt_execute_directory=true" Machine classads removed from the collector Optionally can stick around marked as "absent", visible with "condor_status -absent"
Take Aways Failure of the central manager is not catastrophic unless it is down for quite some time (many minutes / maybe hours?). Lost throughput due to failure of a submit node can be minimized by restarting the submit node within job_lease minutes, or by splitting up jobs across multiple submit nodes. HA failover is available in HTCondor for both central manager, CCB, schedd --- but the schedd failover requires a shared file system UW-Madison doesn't bother with any failover; CMS Global pool uses CM and CCB failover.
Questions?