HTCondor / HEP Partnership and Activities HEPiX Fall 2014 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University.

HTCondor / HEP Partnership and Activities HEPiX Fall 2014 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison

University of Wisconsin Center for High Throughput Computing 2

› Open source distributed high throughput computing › Management of resources, jobs, and workflows › Primary objective: assist the scientific community with their high throughput computing needs › Mature technology… HTCondor 3

4 19781983198519921993 Enslow’s DPS paper Miron Livny’s PhD Condor Deployed

› Last year : 96 new enhancements, 2100+ commits, 8 releases, 39k regression test runs › Open source development model › Evolve to meet the needs of the science community in a ever-changing computing landscape Mature… but actively developed 5

BNL (US-ATLAS T1) and FNAL (US-CMS T1) adopted Condor in the early days of PPDG (~2000). Rutherford Appleton Laboratory (UK T1) adopted HTCondor in 2013 6

CERN (T0) HTCondor activity in 2014 7

Desire to work together with the HEP community to leverage our collective experience / effort / know- how to offer an open source solution that meets the growing need of HEP high throughput computing in a challenging budget environment Why am I here? 8 Please talk to one of these handsome fellas!

› Documentation › Community support email list (htcondor- users) › Ticket-tracked developer support Current Channels 9 › Bi-weekly/monthly phone conferences  Identify and track current problems  Communicate and plan future goals  Identify and collaborate on challenges, f2f › Fully open development model › Commercial options for 24/7 Meet w/ CMS, LIGO, IceCube, LSST, FNAL, iPlant, …

› Each year in early May in Madison, WI › Perhaps others locales/focuses? HTCondor Week 10

› Now that we see more HTCondor pools in HEP/LHC, "inter" HTCondor functionality is more important -  Distributed scheduling policy  No reliance on shared UID or file servers file movement scheduling, overlap of file stage out and computation  Networking can traverse firewalls, NATs, private nets CCB (“Condor Connection Broker”)  Federation technologies: flocking, glidein via grid universe jobs, HTCondor CE HTCondor R&D heavily influenced by HEP requirements 11

http://www.cs.wisc.edu/condor 13 your workstation Friendly Condor Pool personal Condor 600 Condor jobs Condor Pool

http://www.cs.wisc.edu/condor 14 How Flocking Works › Add a line to your condor_config : FLOCK_HOSTS = Pool-Foo, Pool-Bar Schedd Collector Negotiator Central Manager (CONDOR_HOST ) Collector Negotiator Pool-Foo Central Manager Collector Negotiator Pool-Bar Central Manager Submit Machine

http://www.cs.wisc.edu/condor 15 HTCondor Flocking › Used by many sites in OSG to easily federate all institutional clusters – fast and easy to setup › Remote pools are contacted in the order specified until jobs are satisfied › User-priority system is “flocking-aware”  A pool’s local users can have priority over remote users “flocking” in.

16 Glidein via Grid Universe › Reliable, durable submission of a job to a remote scheduler to build a batch scheduler overlay network › Supports many “back end” types:  Globus: GT2, GT5  NorduGrid  UNICORE  HTCondor  PBS  LSF  SGE  EC2  OpenStack  Deltacloud  Cream  SSH  BOINC

 Dynamically created via glideInWMS from grid, cloud, and HPC resources  200,000 cores, 400,000 job queued  10 submit nodes – 5 for production, 5 for analysis using DAGMan to manage workflows  Execute nodes will run behind firewalls and NATs, necessitating CCB use. We will also want to minimize the number and frequency of TCP connections.  We expect at least CERN to provide worker nodes with outbound IPv6 but without outbound IPv4 networking. Another current collaboration: CMS “global pool” project 17

Scalability Work 18 › Test, measure, enhance, repeat! › Enhanced protocols to lower latency and connection counts, more non-blocking I/O, statistics › Currently at ~155,000 jobs running reliably across 8 submit machines

› Two pools: IPv4-only and IPv6-only › IPv4 & IPv6 HTCondor submit node (schedd) can participate in both pools › schedd appears single protocol to each pool  Schedd rewrites advertised addresses to match protocol of active stream IPv6 Work: Mixed-mode Flocking schedd 19

› Today:  Our test suite exclusively uses IPv4 › Goals:  Existing tests run twice, once using IPv4, once using IPv6 Excepting stuff we don’t plan on supporting (e.g. standard universe)  New tests for mixed-mode Automated IPv6 testing 20

HTCondor Linux containers support Power to the admin! Tame those jobs! 21

› HTCondor can currently leverage Linux containers / cgroups to run jobs  Limiting/monitoring CPU core usage  Limiting/monitoring physical RAM usage  Tracking all subprocesses  Private file namespace (each job can have its own /tmp!)  Private PID namespace  Chroot jail  Private network namespace (soon! each job can have its own network address) Containers in HTCondor 22

More containers… Docker and HTCondor

This is Docker Docker manages Linux containers. Containers give Linux processes a private: Root file system Process space NATed network

Examples This is an “ubuntu” container This is my host OS, running Fedora Processes in other containers on this machine can NOT see what’s going on in this “ubuntu” container

Command line example $ docker run ubuntu cat /etc/debian_version All docker commands are bound into the “docker” executable “run” command runs a process in a container “ubuntu” is the base filesystem for the container an “image” “cat” is the Unix process, from the image we will run (followed by the arguments)

At the Command Line $ hostname whale $ cat /etc/redhat-release Fedora release 20 (Heisenbug) $ docker run ubuntu cat /etc/debian_version jessie/sid $ time docker run ubuntu sleep 0 real0m1.825s user0m0.017s sys0m0.024s

Images Images provide the user level filesystem Everything but the linux kernel You can make your own The docker hub provides many standard ones Docker can pull images from the hub

Images are copy-on-write All changes written to top-level layer Changes can be pulled out of container after exit

Why should you care? › Reproducibility  How many.so’s in /usr/lib64 do you use?  Will a RHEL 6 app run on RHEL 9 in five years? › Packaging  Image is a great to package large software stacks › Imagine an OSG with container support!

I Know What You Are Thinking!

Isn’t this a Virtual Machine? › Containers share Linux kernel with host › Host can “ps” into container  One-way mirror, not black box › Docker provides namespace for images › Docker containers do not run system daemons  CUPS, email, cron, init, fsck, (think about security!) › Docker images much smaller than VM ones  Just a set of files, not a disk image › Much more likely to be universally available

Semantics: VM vs. Container › VMs provide ONE operation:  Boot the black box  Run until poweroff › Containers provide process-like interface:  Start this process within the container  Run until that process exits  Much more HTCondor-like

Docker and HTCondor › Package HTCondor as docker image › Add new “docker universe”

Docker Universe universe = docker executable = /bin/my_executable arguments = arg1 docker_image = deb7_and_HEP_stack transfer_input_files = some_input output = out error = err log = log queue

Docker Universe universe = docker executable = /bin/my_executable Executable comes either from submit machine or image NOT FROM execute machine

Docker Universe universe = docker executable = /bin/my_executable docker_image =deb7_and_HEP_stack Image is the name of the docker image stored on execute machine

Docker Universe HTCondor can transfer input files from submit machine into container (same with output in reverse) universe = docker executable = /bin/my_executable docker_image =deb7_and_HEP_stack transfer_input_files = some_input

Thank You! 39 Lets talk about your high throughput computing hopes, dreams, and nightmares – thank you for your input!

HTCondor / HEP Partnership and Activities HEPiX Fall 2014 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University.

Similar presentations

Presentation on theme: "HTCondor / HEP Partnership and Activities HEPiX Fall 2014 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HTCondor / HEP Partnership and Activities HEPiX Fall 2014 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University.

Similar presentations

Presentation on theme: "HTCondor / HEP Partnership and Activities HEPiX Fall 2014 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University."— Presentation transcript:

Similar presentations

About project

Feedback