HTCondor and the Network

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

SLA-Oriented Resource Provisioning for Cloud Computing
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #12 LSNAT - Load Sharing NAT (RFC 2391)
(part 3).  Switches, also known as switching hubs, have become an increasingly important part of our networking today, because when working with hubs,
Windows Server 2008 Chapter 11 Last Update
Christopher Bednarz Justin Jones Prof. Xiang ECE 4986 Fall Department of Electrical and Computer Engineering University.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Welcome to CW 2007!!!. The Condor Project (Established ‘85) Distributed Computing research performed by.
Presentation on Osi & TCP/IP MODEL
LARK Bringing Distributed High Throughput Computing to the Network Todd Tannenbaum U of Wisconsin-Madison Garhan Attebury
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.
Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison Open Science Grid (OSG)
OS Services And Networking Support Juan Wang Qi Pan Department of Computer Science Southeastern University August 1999.
1 Condor Team 2011 Established 1985.
Condor Services for the Global Grid: Interoperability between OGSA and Condor Clovis Chapman 1, Paul Wilson 2, Todd Tannenbaum 3, Matthew Farrellee 3,
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
1 Transport Layer: Basics Outline Intro to transport UDP Congestion control basics.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
K. Salah1 Security Protocols in the Internet IPSec.
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
Shaopeng, Ho Architect of Chinac Group
Hello everyone I am rajul and I’ll be speaking on a case study of elastic slurm Case Study of Elastic Slurm Rajul Kumar, Northeastern University
PPP Protocol.
HTCondor Networking Concepts
HTCondor Networking Concepts
HTCondor Security Basics
(HT)Condor - Past and Future
Dynamic Deployment of VO Specific Condor Scheduler using GT4
High Availability Linux (HA Linux)
Encryption and Network Security
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Example: Rapid Atmospheric Modeling System, ColoState U
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
IP Security IP sec IPsec is short for Internet Protocol Security. It was originally created as a part of IPv6, but has been retrofitted into IPv4. It works.
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Miron Livny John P. Morgridge Professor of Computer Science
Thoughts on Computing Upgrade Activities
Lecture 6: TCP/IP Networking By: Adal Alashban
Building Grids with Condor
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
X in [Integration, Delivery, Deployment]
Dean Martin Cadwallader Dean of the Graduate School
Welcome to HTCondor Week #18 (year 33 of our project)
Basic Grid Projects – Condor (Part I)
CLUSTER COMPUTING.
دیواره ی آتش.
Wide Area Workload Management Work Package DATAGRID project
CS4470 Computer Networking Protocols
High Throughput Computing for Astronomers
PPP Protocol.
Condor-G Making Condor Grid Enabled
Satisfying the Ever Growing Appetite of HEP for HTC
Cache writes and examples
Welcome to (HT)Condor Week #19 (year 34 of our project)
Exceptions and networking
Welcome to HTCondor Week #17 (year 32 of our project)
Presentation transcript:

HTCondor and the Network Miron Livny John P. Morgridge Professor of Computer Science Wisconsin Institutes for Discovery University of Wisconsin-Madison

This talk is NOT about network bandwidth!

“ … Since the early days of mankind the primary motivation for the establishment of communities has been the idea that by being part of an organized group the capabilities of an individual are improved. The great progress in the area of inter-computer communication led to the development of means by which stand-alone processing sub-systems can be integrated into multi-computer ‘communities’. … “ Miron Livny, “ Study of Load Balancing Algorithms for Decentralized Distributed Processing Systems.”, Ph.D thesis, July 1983.

Communities depends on the network to exchange information, to coordinate, to establish trust, …

The Open Science Grid Community “The members of the OSG consortium are united by a commitment to promote the adoption and to advance the state of the art of distributed high throughput computing (dHTC) – shared utilization of autonomous resources where all the elements are optimized for maximizing computational throughput.”

OSG by the numbers (09/21/16)

This week numbers … 30M

“When a workflow might consist of 600,000 jobs, we don’t want to rerun them if we make a mistake. So we use DAGMan (Directed Acyclic Graph Manager, a meta-scheduler for HTCondor) and Pegasus workflow manager to optimize changes,” added Couvares. “The combination of Pegasus, HTCondor, and OSG work great together.” Keeping track of what has run and how the workflow progresses, Pegasus translates the abstract layer of what needs to be done into actual jobs for HTCondor, which then puts them out on OSG.

Recent OSG Research Highlights Update on the Brain Trauma Research Center (8/29/2016) Where did all the antimatter go? (7/12/2016) SBGrid uses the OSG to accelerate structural biology research (2/23/2016) OSG computational power helps solve 30-year old protein puzzle (1/12/2016 Harnessing the OSG to better understand the brain (12/2/2015) Using the OSG to study protein evolution at the University of Pennsylvania (10/26/2015) Understanding the Ghostly Neutrino (9/18/2015)

In 1996 I introduced the distinction between High Performance Computing (HPC) and High Throughput Computing (HTC) in a seminar at the NASA Goddard Flight Center in and a month later at the European Laboratory for Particle Physics (CERN). In June of 1997 HPCWire published an interview on High Throughput Computing.

“Increased advanced computing capability has historically enabled new science, and many fields today rely on high-throughput computing for discovery.” “Many fields increasingly rely on high-throughput computing” “Recommendation 2.2. NSF should (a) … and (b) broaden the accessibility and utility of these large-scale platforms by allocating high-throughput as well as high-performance workflows to them.”

High Throughput Computing is a 24-7-365 activity and therefore requires automation FLOPY  (60*60*24*7*52)*FLOPS

HTCondor uses a two phase matchmaking process to first allocate a collection of resources to a requestor and then to select a task to be delegated for execution within the constraints of these resources

S D Match! Match! Wi W3 SchedD StartD MatchMaker Match! Match! Wi I am S and am looking for resources I am D and I am willing to offer you resources S D MM Claim Resource Delegate Work W3 SchedD StartD

Local HTCondor Remote SchedD LSF PBS G-app C-app C-app User Code/ DAGMan HTCondor MM C-app HTCondor StartD MM SchedD Factory Front End OSG SchedD Grid CE SSH CondorC HTCondor MM C-app MM LSF PBS HTCondor G-app Remote C-app

Numbers from the UW CHTC pool

30 days file numbers

TCP numbers

And what about Clouds?

Local HTCondor Remote C-app SchedD C-app VM EC2 OpSt Spot C-app User Code/ DAGMan HTCondor MM C-app HTCondor MM SchedD Factory Front End OSG Cloud SchedD EC2 OpSt Spot HTCondor MM C-app VM StartD Remote C-app

Jobs running on AWS Spot instances 120K 13:00 18:00

Using HTCondor and the OSG GlideIn Factory, ONE submit machine may be managing 100K jobs on 10K remote machines

Network hallenges in managing HTC workloads on many servers at many sites …

We have to deal with worker nodes that have only “outgoing” connectivity – the Condor Connection Broker (CCB)

We have to limit the number of ports used by the SchedD - The Shared Port Daemon

We have to run our own directory service – The Collector + ClassAds + HTCondor network address space

We have to deal with IPv4 and/or IPv6 end points – Dual stack + Directory services

We have to deal with the default (too long) TCP keep alive and with hypervisors that “keep” connection alive

Are you (still) out there???

* We implemented a service (CCB) to work around unidirectional connections such as NATs, and to allow an HTCondor service to have a fixed public network address (i.e. the startd's public ip address itself may be changing because it is running in a restarted VM on OpenStack, but because CCB provides a level of indirection, it can provide a stable address to schedds). * We implemented a daemon (shared port) that does demultiplexing of different streams of data (a) because of the proliferation of firewalls where admins do not want to open up a large range of ports, and (b) because we're still limited to the same number of ephemeral TCP ports (64k) we've had since 1981. * We run our own directory service because DNS isn't always available (or reliable) and doesn't provide enough information anyway. * With Lark extensions, mechanisms for bandwidth monitoring and flow shaping; specifically, the ability to schedule, monitor, account, limit, isolate, control access, and prioritize the network traffic for an arbitrary set of black-box processes. * Mechanisms to limit and determine which interfaces (and/or addresses) we use. (Identity is hard.) * We use datagrams and do our own fragmentation and reassembly, because some routers just drop oversize datagrams. * We had to support IPv6 and mixed-mode operation (identity keeps getting harder). * We've had to do a lot of work with nonblocking i/o to hide network latency over the WAN. * We implemented our own external data representation (XDR) equivalent (CEDAR), to deal with serializing data types across different architectures/operating systems. Note - a paper on CEDAR is available at https://is.gd/3sn20P * We implemented our own security for UDP (using TCP once to do key exchange and caching the result). * We do our own authentication: different sites have different key distribution policies and mechanisms we need to support. * We do our encryption for the same reason. * We do our own integrity checking in our network layer (CEDAR) because the built-in checksum in TCP and UDP (based on 16bit ones-compliemnts) wasn't good enough to detect errors in large quantities of data. Also, our integrity checksum is cryptographically secure to thwart/detect purposeful tampering. * We've implemented transfer queues / tokens / throttles (partly because of disk i/o bandwidth) but also because transferring all flows very slowly is a really bad congestion model for us. * We adjust the TCP keepalive downward, because 2 hours is /way/ too long. We also have problems with TCP keepalives being kept alive (incorrectly?) by hypervisors long after the VM is gone. * Frequently IP address are all we have for identification, so NATs cause us grief. (Identity.) * Distinguishing between and being able to leverage different private networks is a pain. * The TCP windows opens slowly enough, and closes fast enough, that making good use of available bandwidth in our file transfer mechanism challenging. * All files in a job sandbox transfer occur on one unique TCP socket, and we log kernel statistics about this socket to a file for analysis.

Lark: Bringing Network Awareness to High Throughput Computing Zhe Zhang, Brian Bockelman, Dale W. Carder, Todd Tannenbaum University of Nebraska-Lincoln University of Wisconsin-Madison

[ Note: a “job” is a black-box tree of processes ] Goal for this work Make networking a “first-class” managed resource for HTC systems like HTCondor. Ability to schedule, monitor, account, limit, isolate, and prioritize the network traffic for a set of jobs [ Note: a “job” is a black-box tree of processes ] The need to manage the network is increasing because the size of data being processed grows, and as high throughput clusters have become federated into wide-area international computing grids like the Open Science Grid that connect hundreds of clusters across dozens of countries. Job is a black box; scientific apps typically do not follow any conventions, protocols, or frameworks….

Lark per-job network isolation architecture Leverage two essential recent additions to the Linux kernel: network namespaces virtual Ethernet device (pair) Also leverage Open vSwitch Normally a process sees the system network interfaces and routing tables. With network namespaces, a child process can have different and separate instances of network interfaces and routing tables that operate independent of each other.

Resulting node configuration Each individual job ends up with a unique IP address, and thus each job has its own unique identity on the network. This structure looks familiar… Instead of using Open vSwitch , Lark can instead use an iptables-based NAT chain to connect the physical device, or nothing at all. Very similar to a traditional LAN configuration, but inside each node.

Resulting node configuration Each individual job ends up with a unique IP address, and thus each job has its own unique identity on the network. This structure looks familiar… Instead of using Open vSwitch , Lark can instead use an iptables-based NAT chain to connect the physical device, or nothing at all. Very similar to a traditional LAN configuration, but inside each node.