Download presentation
Presentation is loading. Please wait.
Published byEgbert Curtis Modified over 9 years ago
1
HTCondor / HEP Partnership and Activities HEPiX Fall 2014 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison
2
University of Wisconsin Center for High Throughput Computing 2
3
› Open source distributed high throughput computing › Management of resources, jobs, and workflows › Primary objective: assist the scientific community with their high throughput computing needs › Mature technology… HTCondor 3
4
4 19781983198519921993 Enslow’s DPS paper Miron Livny’s PhD Condor Deployed
5
› Last year : 96 new enhancements, 2100+ commits, 8 releases, 39k regression test runs › Open source development model › Evolve to meet the needs of the science community in a ever-changing computing landscape Mature… but actively developed 5
6
BNL (US-ATLAS T1) and FNAL (US-CMS T1) adopted Condor in the early days of PPDG (~2000). Rutherford Appleton Laboratory (UK T1) adopted HTCondor in 2013 6
7
CERN (T0) HTCondor activity in 2014 7
8
Desire to work together with the HEP community to leverage our collective experience / effort / know- how to offer an open source solution that meets the growing need of HEP high throughput computing in a challenging budget environment Why am I here? 8 Please talk to one of these handsome fellas!
9
› Documentation › Community support email list (htcondor- users) › Ticket-tracked developer support Current Channels 9 › Bi-weekly/monthly phone conferences Identify and track current problems Communicate and plan future goals Identify and collaborate on challenges, f2f › Fully open development model › Commercial options for 24/7 Meet w/ CMS, LIGO, IceCube, LSST, FNAL, iPlant, …
10
› Each year in early May in Madison, WI › Perhaps others locales/focuses? HTCondor Week 10
11
› Now that we see more HTCondor pools in HEP/LHC, "inter" HTCondor functionality is more important - Distributed scheduling policy No reliance on shared UID or file servers file movement scheduling, overlap of file stage out and computation Networking can traverse firewalls, NATs, private nets CCB (“Condor Connection Broker”) Federation technologies: flocking, glidein via grid universe jobs, HTCondor CE HTCondor R&D heavily influenced by HEP requirements 11
13
http://www.cs.wisc.edu/condor 13 your workstation Friendly Condor Pool personal Condor 600 Condor jobs Condor Pool
14
http://www.cs.wisc.edu/condor 14 How Flocking Works › Add a line to your condor_config : FLOCK_HOSTS = Pool-Foo, Pool-Bar Schedd Collector Negotiator Central Manager (CONDOR_HOST ) Collector Negotiator Pool-Foo Central Manager Collector Negotiator Pool-Bar Central Manager Submit Machine
15
http://www.cs.wisc.edu/condor 15 HTCondor Flocking › Used by many sites in OSG to easily federate all institutional clusters – fast and easy to setup › Remote pools are contacted in the order specified until jobs are satisfied › User-priority system is “flocking-aware” A pool’s local users can have priority over remote users “flocking” in.
16
16 Glidein via Grid Universe › Reliable, durable submission of a job to a remote scheduler to build a batch scheduler overlay network › Supports many “back end” types: Globus: GT2, GT5 NorduGrid UNICORE HTCondor PBS LSF SGE EC2 OpenStack Deltacloud Cream SSH BOINC
17
Dynamically created via glideInWMS from grid, cloud, and HPC resources 200,000 cores, 400,000 job queued 10 submit nodes – 5 for production, 5 for analysis using DAGMan to manage workflows Execute nodes will run behind firewalls and NATs, necessitating CCB use. We will also want to minimize the number and frequency of TCP connections. We expect at least CERN to provide worker nodes with outbound IPv6 but without outbound IPv4 networking. Another current collaboration: CMS “global pool” project 17
18
Scalability Work 18 › Test, measure, enhance, repeat! › Enhanced protocols to lower latency and connection counts, more non-blocking I/O, statistics › Currently at ~155,000 jobs running reliably across 8 submit machines
19
› Two pools: IPv4-only and IPv6-only › IPv4 & IPv6 HTCondor submit node (schedd) can participate in both pools › schedd appears single protocol to each pool Schedd rewrites advertised addresses to match protocol of active stream IPv6 Work: Mixed-mode Flocking schedd 19
20
› Today: Our test suite exclusively uses IPv4 › Goals: Existing tests run twice, once using IPv4, once using IPv6 Excepting stuff we don’t plan on supporting (e.g. standard universe) New tests for mixed-mode Automated IPv6 testing 20
21
HTCondor Linux containers support Power to the admin! Tame those jobs! 21
22
› HTCondor can currently leverage Linux containers / cgroups to run jobs Limiting/monitoring CPU core usage Limiting/monitoring physical RAM usage Tracking all subprocesses Private file namespace (each job can have its own /tmp!) Private PID namespace Chroot jail Private network namespace (soon! each job can have its own network address) Containers in HTCondor 22
23
More containers… Docker and HTCondor
24
This is Docker Docker manages Linux containers. Containers give Linux processes a private: Root file system Process space NATed network
25
Examples This is an “ubuntu” container This is my host OS, running Fedora Processes in other containers on this machine can NOT see what’s going on in this “ubuntu” container
26
Command line example $ docker run ubuntu cat /etc/debian_version All docker commands are bound into the “docker” executable “run” command runs a process in a container “ubuntu” is the base filesystem for the container an “image” “cat” is the Unix process, from the image we will run (followed by the arguments)
27
At the Command Line $ hostname whale $ cat /etc/redhat-release Fedora release 20 (Heisenbug) $ docker run ubuntu cat /etc/debian_version jessie/sid $ time docker run ubuntu sleep 0 real0m1.825s user0m0.017s sys0m0.024s
28
Images Images provide the user level filesystem Everything but the linux kernel You can make your own The docker hub provides many standard ones Docker can pull images from the hub
29
Images are copy-on-write All changes written to top-level layer Changes can be pulled out of container after exit
30
Why should you care? › Reproducibility How many.so’s in /usr/lib64 do you use? Will a RHEL 6 app run on RHEL 9 in five years? › Packaging Image is a great to package large software stacks › Imagine an OSG with container support!
31
I Know What You Are Thinking!
32
Isn’t this a Virtual Machine? › Containers share Linux kernel with host › Host can “ps” into container One-way mirror, not black box › Docker provides namespace for images › Docker containers do not run system daemons CUPS, email, cron, init, fsck, (think about security!) › Docker images much smaller than VM ones Just a set of files, not a disk image › Much more likely to be universally available
33
Semantics: VM vs. Container › VMs provide ONE operation: Boot the black box Run until poweroff › Containers provide process-like interface: Start this process within the container Run until that process exits Much more HTCondor-like
34
Docker and HTCondor › Package HTCondor as docker image › Add new “docker universe”
35
Docker Universe universe = docker executable = /bin/my_executable arguments = arg1 docker_image = deb7_and_HEP_stack transfer_input_files = some_input output = out error = err log = log queue
36
Docker Universe universe = docker executable = /bin/my_executable Executable comes either from submit machine or image NOT FROM execute machine
37
Docker Universe universe = docker executable = /bin/my_executable docker_image =deb7_and_HEP_stack Image is the name of the docker image stored on execute machine
38
Docker Universe HTCondor can transfer input files from submit machine into container (same with output in reverse) universe = docker executable = /bin/my_executable docker_image =deb7_and_HEP_stack transfer_input_files = some_input
39
Thank You! 39 Lets talk about your high throughput computing hopes, dreams, and nightmares – thank you for your input!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.