HTCondor and the Network

HTCondor and the Network
Miron Livny John P. Morgridge Professor of Computer Science Wisconsin Institutes for Discovery University of Wisconsin-Madison

This talk is NOT about network bandwidth!

“ … Since the early days of mankind the primary motivation for the establishment of communities has been the idea that by being part of an organized group the capabilities of an individual are improved. The great progress in the area of inter-computer communication led to the development of means by which stand-alone processing sub-systems can be integrated into multi-computer ‘communities’. … “ Miron Livny, “ Study of Load Balancing Algorithms for Decentralized Distributed Processing Systems.”, Ph.D thesis, July 1983.

Communities depends on the network to exchange information, to coordinate, to establish trust, …

The Open Science Grid Community
“The members of the OSG consortium are united by a commitment to promote the adoption and to advance the state of the art of distributed high throughput computing (dHTC) – shared utilization of autonomous resources where all the elements are optimized for maximizing computational throughput.”

OSG by the numbers (09/21/16)

This week numbers … 30M

“When a workflow might consist of 600,000 jobs, we don’t want to rerun them if we make a mistake. So we use DAGMan (Directed Acyclic Graph Manager, a meta-scheduler for HTCondor) and Pegasus workflow manager to optimize changes,” added Couvares. “The combination of Pegasus, HTCondor, and OSG work great together.” Keeping track of what has run and how the workflow progresses, Pegasus translates the abstract layer of what needs to be done into actual jobs for HTCondor, which then puts them out on OSG.

Recent OSG Research Highlights
Update on the Brain Trauma Research Center (8/29/2016) Where did all the antimatter go? (7/12/2016) SBGrid uses the OSG to accelerate structural biology research (2/23/2016) OSG computational power helps solve 30-year old protein puzzle (1/12/2016 Harnessing the OSG to better understand the brain (12/2/2015) Using the OSG to study protein evolution at the University of Pennsylvania (10/26/2015) Understanding the Ghostly Neutrino (9/18/2015)

In 1996 I introduced the distinction between High Performance Computing (HPC) and High Throughput Computing (HTC) in a seminar at the NASA Goddard Flight Center in and a month later at the European Laboratory for Particle Physics (CERN). In June of 1997 HPCWire published an interview on High Throughput Computing.

“Increased advanced computing capability has historically enabled new science, and many fields today rely on high-throughput computing for discovery.” “Many fields increasingly rely on high-throughput computing” “Recommendation 2.2. NSF should (a) … and (b) broaden the accessibility and utility of these large-scale platforms by allocating high-throughput as well as high-performance workflows to them.”

High Throughput Computing is a 24-7-365 activity and therefore requires automation
FLOPY  (60*60*24*7*52)*FLOPS

HTCondor uses a two phase matchmaking process to first allocate a collection of resources to a requestor and then to select a task to be delegated for execution within the constraints of these resources

S D Match! Match! Wi W3 SchedD StartD
MatchMaker Match! Match! Wi I am S and am looking for resources I am D and I am willing to offer you resources S D MM Claim Resource Delegate Work W3 SchedD StartD

Local HTCondor Remote SchedD LSF PBS G-app C-app C-app User Code/
DAGMan HTCondor MM C-app HTCondor StartD MM SchedD Factory Front End OSG SchedD Grid CE SSH CondorC HTCondor MM C-app MM LSF PBS HTCondor G-app Remote C-app

Numbers from the UW CHTC pool

30 days file numbers

TCP numbers

And what about Clouds?

Local HTCondor Remote C-app SchedD C-app VM EC2 OpSt Spot C-app
User Code/ DAGMan HTCondor MM C-app HTCondor MM SchedD Factory Front End OSG Cloud SchedD EC2 OpSt Spot HTCondor MM C-app VM StartD Remote C-app

Jobs running on AWS Spot instances
120K 13:00 18:00

Using HTCondor and the OSG GlideIn Factory, ONE submit machine may be managing 100K jobs on 10K remote machines

Network hallenges in managing HTC workloads on many servers at many sites …

We have to deal with worker nodes that have only “outgoing” connectivity – the Condor Connection Broker (CCB)

We have to limit the number of ports used by the SchedD - The Shared Port Daemon

We have to run our own directory service – The Collector + ClassAds + HTCondor network address space

We have to deal with IPv4 and/or IPv6 end points – Dual stack + Directory services

We have to deal with the default (too long) TCP keep alive and with hypervisors that “keep” connection alive

Are you (still) out there???

* We implemented a service (CCB) to work around unidirectional connections such as NATs, and to allow an HTCondor service to have a fixed public network address (i.e. the startd's public ip address itself may be changing because it is running in a restarted VM on OpenStack, but because CCB provides a level of indirection, it can provide a stable address to schedds). * We implemented a daemon (shared port) that does demultiplexing of different streams of data (a) because of the proliferation of firewalls where admins do not want to open up a large range of ports, and (b) because we're still limited to the same number of ephemeral TCP ports (64k) we've had since 1981. * We run our own directory service because DNS isn't always available (or reliable) and doesn't provide enough information anyway. * With Lark extensions, mechanisms for bandwidth monitoring and flow shaping; specifically, the ability to schedule, monitor, account, limit, isolate, control access, and prioritize the network traffic for an arbitrary set of black-box processes. * Mechanisms to limit and determine which interfaces (and/or addresses) we use. (Identity is hard.) * We use datagrams and do our own fragmentation and reassembly, because some routers just drop oversize datagrams. * We had to support IPv6 and mixed-mode operation (identity keeps getting harder). * We've had to do a lot of work with nonblocking i/o to hide network latency over the WAN. * We implemented our own external data representation (XDR) equivalent (CEDAR), to deal with serializing data types across different architectures/operating systems. Note - a paper on CEDAR is available at * We implemented our own security for UDP (using TCP once to do key exchange and caching the result). * We do our own authentication: different sites have different key distribution policies and mechanisms we need to support. * We do our encryption for the same reason. * We do our own integrity checking in our network layer (CEDAR) because the built-in checksum in TCP and UDP (based on 16bit ones-compliemnts) wasn't good enough to detect errors in large quantities of data. Also, our integrity checksum is cryptographically secure to thwart/detect purposeful tampering. * We've implemented transfer queues / tokens / throttles (partly because of disk i/o bandwidth) but also because transferring all flows very slowly is a really bad congestion model for us. * We adjust the TCP keepalive downward, because 2 hours is /way/ too long. We also have problems with TCP keepalives being kept alive (incorrectly?) by hypervisors long after the VM is gone. * Frequently IP address are all we have for identification, so NATs cause us grief. (Identity.) * Distinguishing between and being able to leverage different private networks is a pain. * The TCP windows opens slowly enough, and closes fast enough, that making good use of available bandwidth in our file transfer mechanism challenging. * All files in a job sandbox transfer occur on one unique TCP socket, and we log kernel statistics about this socket to a file for analysis.

Lark: Bringing Network Awareness to High Throughput Computing
Zhe Zhang, Brian Bockelman, Dale W. Carder, Todd Tannenbaum University of Nebraska-Lincoln University of Wisconsin-Madison

[ Note: a “job” is a black-box tree of processes ]
Goal for this work Make networking a “first-class” managed resource for HTC systems like HTCondor. Ability to schedule, monitor, account, limit, isolate, and prioritize the network traffic for a set of jobs [ Note: a “job” is a black-box tree of processes ] The need to manage the network is increasing because the size of data being processed grows, and as high throughput clusters have become federated into wide-area international computing grids like the Open Science Grid that connect hundreds of clusters across dozens of countries. Job is a black box; scientific apps typically do not follow any conventions, protocols, or frameworks….

Lark per-job network isolation architecture
Leverage two essential recent additions to the Linux kernel: network namespaces virtual Ethernet device (pair) Also leverage Open vSwitch Normally a process sees the system network interfaces and routing tables. With network namespaces, a child process can have different and separate instances of network interfaces and routing tables that operate independent of each other.

Resulting node configuration
Each individual job ends up with a unique IP address, and thus each job has its own unique identity on the network. This structure looks familiar… Instead of using Open vSwitch , Lark can instead use an iptables-based NAT chain to connect the physical device, or nothing at all. Very similar to a traditional LAN configuration, but inside each node.

HTCondor and the Network

Similar presentations

Presentation on theme: "HTCondor and the Network"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HTCondor and the Network

Similar presentations

Presentation on theme: "HTCondor and the Network"— Presentation transcript:

Similar presentations

About project

Feedback