High Availability in HTCondor

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Grid Computing I CONDOR.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Grid job submission using HTCondor Andrew Lahiff.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Gabi Kliot Computer Sciences Department Technion – Israel Institute of Technology Adding High Availability to Condor Central Manager Adding High Availability.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
HTCondor Annex (There are many clouds like it, but this one is mine.)
Debugging Common Problems in HTCondor
Condor DAGMan: Managing Job Dependencies with Condor
HTCondor Networking Concepts
HTCondor Networking Concepts
HTCondor Security Basics
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Intermediate HTCondor: Workflows Monday pm
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Things you may not know about HTCondor
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Workload Management System
IW2D migration to HTCondor
HP ArcSight ESM 6.8c HA Fail Over Illustrated
Things you may not know about HTCondor
Migratory File Services for Batch-Pipelined Workloads
Adding High Availability to Condor Central Manager Tutorial
Building Grids with Condor
Accounting in HTCondor
Software Engineering Introduction to Apache Hadoop Map Reduce
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
SQL Server High Availability Amit Vaid.
Troubleshooting Your Jobs
Negotiator Policy and Configuration
Accounting, Group Quotas, and User Priorities
湖南大学-信息科学与工程学院-计算机与科学系
Haiyan Meng and Douglas Thain
HTCondor Security Basics HTCondor Week, Madison 2016
Condor Glidein: Condor Daemons On-The-Fly
Basic Grid Projects – Condor (Part I)
Upgrading Condor Best Practices
Introduction to High Throughput Computing and HTCondor
Brian Lin OSG Software Team University of Wisconsin - Madison
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Condor: Firewall Mirroring
Condor-G Making Condor Grid Enabled
Job Submission Via File Transfer
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

Talking Points: HA Configurations for HTCondor Services INFN HTCondor Workshop Oct 2016

High Availability in HTCondor Discuss High Availability of Central Manager (Collector, Negotiator, CCB) Submit node (Schedd) Execute node (Startd)

HA of Central Manager What happens if Central Manager server fails? condor_status fails Unclaimed slots stay idle (no new matches made) However… Jobs keep running!! And new jobs are launched on claimed slots!!! until claim on slot is broken condor_q, condor_submit, condor_rm all continue working When Central Manager restarted, all state restored within a few minutes

Auto Failover of Central Manager Three options Don't worry about it Use your data center failover system (VMWare, RHEL Cluster Suite, Mesos+Marathan, …). Nothing fancy; disk in SAN, reboot if dead. Use HTCondor's condor_had mechanism HTCondor's CM HAD Approach CM has two services: collector, negotiator Collector, including CCB, is active/active - daemons connect to both CMs, tools randomly pick a live one to use (load balance) Negotiator is active/passive (can only have one active negotiator per pool), controlled by condor_had daemon. Can be primary/secondary, or peer/peer. Negotiator state (user usage, priorities) replicated and re-merged by condor_replication

Central Manager 1 Master Execute Nodes Central Manager 2 Master Collector Replication Had Negotiator Execute Nodes Central Manager 2 Master Collector Replication Had

Central Manager 1 Master Execute Nodes Central Manager 2 Master Collector Replication Had Execute Nodes Central Manager 2 Master Collector Replication Had Negotiator

HA of Submit Machine What happens if server running Schedd fails? condor_q, condor_submit, condor_rm stop working However… Only impacts jobs submitted from that schedd Jobs keep running! Execute nodes will let the job keep running for the duration of the job_lease (40 min by default); even if job completes before lease expires, slot will wait idle for the schedd to reconnect and get exit status/output So if Schedd restarts within the job_lease (40min default), everything continues as normal. Useful for reboots, upgrades. Quiz: Why not make job_lease hours long?

Auto Failover of Schedd Three options Don't worry about it Use your data center failover system (VMWare, RHEL Cluster Suite, Mesos+Marathan, …). Nothing fancy; disk in SAN, reboot if dead. Use mechanism in condor_master to only run one instance of a daemon Submit node is harder than CM failover because there is a lot more state State generated by HTCondor: job queue, event logs State generated by user/jobs: input files, output files To deal with all the state, we assume a shared file system between two servers Would love to try DRBD (distributed replicated block device)… Failover is peer/peer

condor_submit -name MySchedd Submit Machine 1 Master Collector Schedd Shared Filesystem condor_submit -name MySchedd Master Schedd Submit Machine 2

HA of Execute Node What happens if server running a Startd fails? Jobs running there will be get restarted someplace else Data written to local disk by that node still there… if privacy concern, HTCondor can encrypt job I/O on the fly via "encrypt_execute_directory=true" Machine classads removed from the collector Optionally can stick around marked as "absent", visible with "condor_status -absent"

Take Aways Failure of the central manager is not catastrophic unless it is down for quite some time (many minutes / maybe hours?). Lost throughput due to failure of a submit node can be minimized by restarting the submit node within job_lease minutes, or by splitting up jobs across multiple submit nodes. HA failover is available in HTCondor for both central manager, CCB, schedd --- but the schedd failover requires a shared file system  UW-Madison doesn't bother with any failover; CMS Global pool uses CM and CCB failover.

Questions?