Containing HTCondor Greg Thain INFN 2016.

Slides:



Advertisements
Similar presentations
Operating System Structures
Advertisements

Putting your users in a Box Greg Thain Center for High Throughput Computing.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Greg Thain HTCondor Week 2015
HTCondor / HEP Partnership and Activities HEPiX Fall 2014 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University.
Linux Operating System
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.
Guide to Linux Installation and Administration, 2e1 Chapter 8 Basic Administration Tasks.
Section 3.1: Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
Write-through Cache System Policies discussion and A introduction to the system.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Guide To UNIX Using Linux Third Edition Chapter 8: Exploring the UNIX/Linux Utilities.
Hands On UNIX II Dorcas Muthoni. Processes A running instance of a program is called a "process" Identified by a numeric process id (pid)‏  unique while.
Processes CS 6560: Operating Systems Design. 2 Von Neuman Model Both text (program) and data reside in memory Execution cycle Fetch instruction Decode.
Breaking Barriers Exploding with Possibility Breaking Barriers Exploding with Possibility The Cloud Era Unveiled.
HTCondor Recent Enhancement and Future Directions HTCondor Recent Enhancement and Future Directions HEPiX Fall 2015 Todd Tannenbaum Center for High Throughput.
Docker and Container Technology
System Administration. Logging in as System Administrator System Admin login (aka superuser, aka root) –login id: root –Default PS1 prompt: # –Home directory:
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Docker Overview Automating.
Lecture 02 File and File system. Topics Describe the layout of a Linux file system Display and set paths Describe the most important files, including.
The Kernel At a high level, the kernel in an operating system serves as the bridge between applications and the actual data processing of the hardware.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
Five todos when moving an application to distributed HTC.
Putting your users in a Box John (TJ) Knoeller Center for High Throughput Computing.
OpenShift & SELinux Dan Walsh Twitter: #rhatdan
Linux Containers and Docker
Andrea Chierici Virtualization tutorial Catania 1-3 dicember 2010
CHTC Policy and Configuration
HTCondor Annex (There are many clouds like it, but this one is mine.)
Fundamentals Sunny Sharma Microsoft
SYSTEM ADMINISTRATION PART I by İlker Korkmaz and Kaya Oğuz
Dockerize OpenEdge Srinivasa Rao Nalla.
...looking a bit closer under the hood
Intermediate HTCondor: Workflows Monday pm
Linux Commands Help HANDS ON TRAINING Author: Muhammad Laique
Linux Containers Overview & Roadmap
Things you may not know about HTCondor
Hands On UNIX AfNOG 2010 Kigali, Rwanda
High Availability in HTCondor
Operating System Structure
Machine Learning Workshop
Containers and Virtualisation
The Linux Operating System
Structure of Unix OS.
Andrew Pruski SQL Server & Containers
Containers in HPC By Raja.
Hands On UNIX AfNOG X Cairo, Egypt
Integration of Singularity With Makeflow
Troubleshooting Your Jobs
Haiyan Meng and Douglas Thain
Page Replacement.
Chapter 2: The Linux System Part 2
Unix : Introduction and Commands
Intro about Contanier and Docker Technology
Upgrading Condor Best Practices
Lecture Topics: 11/1 General Operating System Concepts Processes
Periodic Processes Chapter 9.
Docker Some slides from Martin Meyer Vagrant Box:
Chapter 2: Operating-System Structures
Linux Filesystem Management
Azure Container Service
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Putting your users in a Box
Presentation transcript:

Containing HTCondor Greg Thain INFN 2016

3 Protections Protect the machine from the job. Protect the job from the machine. Protect one job from another. Actually impossible to do with pure POSIX. Especially because a job is a bunch of processes.

The perfect box Allows nesting Need not require root Can’t be broken out of Portable to all OSes Allows full management: Creation // Destruction Monitoring // Limiting

A Job ain’t nothing but work Resources a job can (ab)use CPU Memory Disk Signals Network.

Previous Solutions HTCondor Preempt expression setrlimit call TARGET.MemoryUsage > threshold ProportionalSetSizeKb > threshold setrlimit call USER_JOB_WRAPPER STARTER_RLIMIT_AS

CPU AFFINITY ASSIGN_CPU_AFFINITY=true Now works with dynamic slots Need not be root Any Linux version Only limits the job

The Big Hammer Some people see this problem, and say “I know, we’ll use a Virtual Machine”

Problems with VMs Might need hypervisor installed The right hypervisor (the right Version…) Need to keep full OS image maintained Difficult to debug Hard to federate Just too heavyweight

Containers, not VMs Want opaque box Much LXC work applicable here Work with Best feature of HTCondor ever?

PID namespaces You can’t kill what you can’t see Requirements: HTCondor 7.9.4+ RHEL 6 USE_PID_NAMESPACES = true (off by default) Doesn’t work with privsep Must be root

PID Namespaces Init (1) Master (pid 15) Startd (pid 26) Starter (pid 73) Starter (pid 39) Job (pid 1) Job (pid 1)

Named Chroots “Lock the kids in their room” Startd advertises set NAMED_CHROOT = /foo/R1,/foo/R2 Job picks one: +RequestedChroot = “/foo/R1” Make sure path is secure!

Control Groups aka “cgroups” Two basic kernel abstractions: 1) nested groups of processes 2) “controllers” which limit resources

Control Cgroup setup Implemented as filesystem /proc/self/cgroup Mounted on /sys/fs/cgroup, or /cgroup or … /proc/self/cgroup

Cgroup controllers Cpu Memory freezer

Enabling cgroups Requires: RHEL6, RHEL7 even better HTCondor 7.9.5+ Rootly condor BASE_CGROUP=htcondor And… cgroup fs mounted…

Cgroups with HTCondor Starter puts each job into own cgroup Named exec_dir + job id Procd monitors MEMORY attr into memory controller CGROUP_MEMORY_LIMIT_POLICY Hard or soft or none Job goes on hold with specific message

Cgroup artifacts StarterLog: ProcLog 04/22/13 11:39:08 Requesting cgroup htcondor/condor_exec_slot1@localhost for job … ProcLog cgroup to htcondor/condor_exec_slot1@localhost for ProcFamily 2727. 04/22/13 11:39:13 : PROC_FAMILY_GET_USAGE 04/22/13 11:39:13 : gathering usage data for family with root pid 2724 04/22/13 11:39:17 : PROC_FAMILY_GET_USAGE 04/22/13 11:39:17 : gathering usage …

$ condor_q -- Submitter: localhost : <127.0.0.1:58873> : localhost  ID      OWNER            SUBMITTED RUN_TIME ST PRI SIZE CMD                   2.0   gthain          4/22 11:36 0+00:00:02 R 0 0.0 sleep 3600 $ ps ax | grep 3600 gthain 2727  4268 4880 condor_exec.exe 3600    

A process with Cgroups $ cat /proc/2727/cgroup 3:freezer:/htcondor/condor_exec_slot1@localhost 2:memory:/htcondor/condor_exec_slot1@localhost 1:cpuacct,cpu:/htcondor/condor_exec_slot1@localhost

$ cd /sys/fs/cgroup/memory/htcondor/condor_exec_slot1@localhost/ $ cat memory.usage_in_bytes 258048 $ cat tasks 2727

MOUNT_UNDER_SCRATCH Goal: protect /tmp from shared jobs Requires Condor 7.9.4+ RHEL 5 HTCondor must be running as root MOUNT_UNDER_SCRATCH = /tmp,/var/tmp

MOUNT_UNDER_SCRATCH MOUNT_UNDER_SCRATCH=/tmp,/var/tmp Each job sees private /tmp, /var/tmp Downsides: No sharing of files in /tmp

Ancient History: Chroot HTCondor used to chroot every job: No job could touch the file system Private files in host machine stayed private

Chroot: more trouble than value Increasingly difficult to work: Shared libraries /dev /sys /etc /var/run pipes for syslog, etc. How to create root filesystem? Easier now with yum, apt get, etc., but still hard:

Repos make images Easier* $ dnf -y --releasever=21 –nogpg installroot=/srv/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd dnf fedora-release vim-minimal $ debootstrap --arch=amd64 unstable ~/debian-tree/ $ pacstrap -c -d ~/arch-tree/ base

We gave up! HTCondor no longer chroots all jobs But you can optionally do so. Very few site sites do… NAMED_CHROOT = /foo

Enter Docker!

This is Docker Docker manages Linux containers. Containers give Linux processes a private: Root file system Process space NATed network UID space

Examples This is an “ubuntu” container Processes in other containers on this machine can NOT see what’s going on in this “ubuntu” container This is my host OS, running Fedora

At the Command Line $ hostname whale $ cat /etc/redhat-release Fedora release 20 (Heisenbug) $ docker run ubuntu cat /etc/debian_version jessie/sid $ time docker run ubuntu sleep 0 real 0m1.825s user 0m0.017s sys 0m0.024s

More CLI detail $ docker run ubuntu cat /etc/debian_version “cat” is the Unix process, from the image we will run (followed by the arguments) “ubuntu” is the base filesystem for the container an “image” “run” command runs a process in a container “How do images get made”? More on that later. All docker commands are bound into the “docker” executable

Images Images provide the user level filesystem Doesn’t contain the linux kernel Or device drivers Or swap space Very small: ubuntu: 200Mb. Images are READ ONLY

Docker run two step Every image that docker run must be local How to get $ docker search image-name $ docker pull image-name Docker run implies pull first! run can fail if image doesn’t exist or is unreachable

Where images come from Docker, inc provides a public-access hub Contains 10,000+ publically usable images behind a CDN What’s local? $ docker images REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE new_ubu latest b67902967df7 8 weeks ago 192.7 MB <none> <none> dd58b0ec6b9a 8 weeks ago 192.7 MB <none> <none> 1d19dc9e2e4f 8 weeks ago 192.7 MB rocker/rstudio latest 14fad19147b6 8 weeks ago 787 MB ubuntu latest d0955f21bf24 8 weeks ago 192.7 MB busybox latest 4986bf8c1536 4 months ago 2.433 MB How to get $ docker search image-name $ docker pull image-name

Image name hub.demo.org:8080/user/image:ver Image name and version: (default: “latest”) Name of the user (default: system user) Name of the hub (default docker-io)

Wait! I don’t want my images public! Easy to make your own images (from tar files) The docker hub is open source Straightforward to start your own How is it distributed?

Docker hub is an image! $ docker run docker/docker-registry (and a bunch of setup – google for details) Any production site will want to run own hub Or put a caching proxy in front of the public one

Under the hood of images Images are composed of layers Images can share base layers: ubuntu : 200 Mb ubuntu + R : 250 Mb ubuntu + matlab : 250 Mb All three: 300 Mb.

Container vs. Image Image is like Unix program on disk read only, static Container is like Unix process Docker run starts a container from an image Container states: like a condor job: Running Stopped

Containers $ docker ps shows running containers $ docker ps –a CONTAINER ID IMAGE COMMAND NAMES b71fff77e7b9 ubuntu:latest /bin/sleep owly_tannenba shows running containers $ docker ps –a 7eff0a4dd0b4 debian:jessie /bin/sleep owly_tannenba

Operations on Containers $ docker ps –a $ docker run … $ docker stop containerId $ docker restart containerId $ docker rm containerId

Where is my output? $ docker diff containerId $ sudo docker diff 7bbb C /dev A /dev/kmsg C /etc D /foo $ docker cp containerId:/path /host Works on running or stopped containers

Or, use “volumes” $ docker run –v /host:/container … Volume is a directory that isn’t mapped Output to volume goes directly to host Fast: just a local mount

Why should you care? Reproducibilty Packaging How many .so’s in /usr/lib64 do you use? Will a RHEL 6 app run on RHEL 9 in five years? Packaging Image is a great to package large software stacks Ease of inspection and management Imagine an OSG with container support!

I Know What You Are Thinking!

Isn’t this a Virtual Machine? Containers share Linux kernel with host Host can “ps” into container One-way mirror, not black box Docker provides namespace for images Docker containers do not run system daemons CUPS, email, cron, init, fsck, (think about security!) Docker images much smaller than VM ones Just a set of files, not a disk image Much more likely to be universally available

Semantics: VM vs. Container VMs provide ONE operation: Boot the black box Run until poweroff Containers provide process interface: Start this process within the contain Run until that process exits Much more Condor-like

Docker and HTCondor New “docker universe” (not actually new universe id)

Installation of Docker universe Need docker (maybe from EPEL) $ yum install docker-io Condor needs to be in the docker group! $ useradd –G docker condor $ service docker start

What? No Knobs? Default install should require no condor knobs! But we have them anyway: DOCKER = /usr/bin/docker

Condor startd detects docker $ condor_status –l | grep –i docker HasDocker = true DockerVersion = "Docker version 1.5.0, build a8a31ef/1.5.0“ $ condor_status –const HasDocker Check StarterLog for error messages

Docker Universe universe = docker executable = /bin/my_executable arguments = arg1 docker_image = deb7_and_HEP_stack transfer_input_files = some_input output = out error = err log = log queue

Docker Universe Job Is still a job Docker containers have the job-nature condor_submit condor_rm condor_hold Write entries to the user log event log condor_dagman works with them Policy expressions work. Matchmaking works User prio / job prio / group quotas all work Stdin, stdout, stderr work Etc. etc. etc.*

Docker Universe Executable comes either from submit machine or image universe = docker executable = /bin/my_executable Executable comes either from submit machine or image NEVER FROM execute machine!

Docker Universe Executable can even be omitted! universe = docker # executable = /bin/my_executable Executable can even be omitted! trivia: true for what other universe? (Images can name a default command)

Docker Universe If executable is transferred, universe = docker executable = ./my_executable input_files = my_executable If executable is transferred, Executable copied from submit machine (useful for scripts)

Docker Universe universe = docker executable = /bin/my_executable docker_image =deb7_and_HEP_stack Image is the name of the docker image stored on execute machine. Condor will fetch it if needed.

Docker Universe universe = docker transfer_input_files= some_input HTCondor can transfer input files from submit machine into container (same with output in reverse)

Condor’s use of Docker Condor volume mounts the scratch dir Condor sets the cwd of job to the scratch dir Can’t see NFS mounted filesystems! Condor runs the job with the usual uid rules. Sets container name to HTCJob_$(CLUSTER) _$(PROC)_slotName

Scratch dir == Volume Means normal file xfer rules apply transfer in, transfer out subdirectory rule holds condor_tail works RequestDisk applies to scratch dir, not container Any changes to the container are not xfered Container is removed on job exit

Docker Resource limiting RequestCpus = 4 RequestMemory = 1024M RequestDisk = Somewhat ignored… RequestCpus translated into cgroup shares RequestMemory enforced If exceeded, job gets OOM killed job goes on hold

Why is my job on hold? Docker couldn’t find image name: $ condor_q -hold -- Submitter: localhost : <127.0.0.1:49411?addrs=127.0.0.1:49411> : localhost  ID      OWNER          HELD_SINCE  HOLD_REASON                                                                     286.0   gthain          5/10 10:13 Error from slot1@localhost: Cannot start container: invalid image name: debain Exceeded memory limit? Just like vanilla job with cgroups  

Coming soon… Advertise images we already have Garbage collection of used images Report resource usage

Potential Features? Network support? Better than NAT? LARKy? Support for shared filesystems? Mapping other directories to containers? Get entire container diff back? Run containers as root? Mount fake /proc/cpuinfo and friends condor_ssh_to_job Automatic checkpoint and restart of containers!

Custom Volume Mounts Admin-specified HasDockerVolumesA = true DOCKER_VOLUMES = A, B DOCKER_VOLUME_DIR_A = /path1 DOCKER_VOLUME_DIR_B = /path2:ro DOCKER_MOUNT_VOLUMES = A, B HasDockerVolumesA = true

Surprises with Docker Moving fast – bugs added/removed quickly 10 Gb limit on container growth Security concerns Adding new hub requires ssl cert on client Containers don’t nest by default Docker needs root – problem for glidein  No support for Windows/Mac/BSD or other Everyone shares a linux kernel

The “init” problem Or, “How come my docker job isn’t exiting” Docker process runs as pid 1 in pid namespace Linux blocks all unhandled catchable signals Soft kills usually don’t work shell wrapper fixes condor_pid_ns_init

Alternatives to docker RedHat: Rocket Systemd: nspawn

Summary Containment isn’t necessarily containers Docker universe runs containers like jobs Could be game-changing Very interested in user feedback