Containers and other technology at the Tier-1

Containers and other technology at the Tier-1
Andrew Lahiff 7th April 2017, GridPP38 - Sussex

Overview Worker nodes & containers at RAL Container cluster managers
past, present & future Container cluster managers Other technology

Introduction Long history of trying to isolate jobs in our batch system protect the machine from jobs protect the jobs from the machine protect one job from another Back in the Days of Torque, things were very simple jobs run with different uids jobs which used too much memory killed jobs which used too much CPU or wall time killed

Introduction Since migrating to HTCondor Linux kernel functionality has improved our ability to isolate jobs cgroups for resource limits & monitoring, ensuring processes can’t escape the batch system CPU, memory, ... PID namespaces processes in a job can’t see any other processes on the host mount namespaces /tmp, /var/tmp inside each job is unique This has (mostly) worked well

Introduction Limitation: all jobs use the same root filesystem as the host this strongly ties the jobs to the host OS SL6 host: can only run SL6 jobs software/OS dictated by LHC experiments Possible solution? HTCondor’s named chroot functionality specify a directory containing an alternative root filesystem Problem difficult to create the environments never really took off Successfully tested at RAL with CMS jobs in early 2015 SL6 jobs running on SL7 machines

HTCondor Docker universe
By default HTCondor checks if Docker installed HTCondor runs each job in a Docker container History Introduced in HTCondor in June 2015 Successfully ran LHC jobs at RAL in 2015 jobs in SL6 containers on SL7 worker nodes Lots of bug fixes & improvements made Nebraska Tier 2 migrated fully to Docker universe in summer 2016

containers run as pool account users, not root users don’t have access to the Docker daemon at all no way for users to specify arbitrary images via the Grid Worker node HTCondor Docker engine job job job

Running jobs in containers on worker nodes now much more important for us Echo in order to get the best performance we want to run an xrootd gateway on every worker node this requires SL7 on worker nodes now

Worker nodes First step: move to SL7 but with as few changes as possible many things (CVMFS, config) bind-mounted into the containers HTCondor machine job features HTCondor machine job features CVMFS /etc/grid-security CVMFS /etc/grid-security /etc/arc RPMs (VO dependencies) /etc/arc /etc/<vo> glexec /etc/<vo> grid config files grid config files SL6 worker node SL7 worker node RPMs (VO dependencies) glexec CentOS 6 image

CVMFS There are a few options, e.g. We’re using autofs
static CVMFS mounts, bind mount from host CERN’s CVMFS Docker volume plugin autofs bind mount from host, using shared mount propagation We’re using autofs for multi-VO sites this seems the most sensible choice discovered a problem when restarting autofs on SL7 a workaround is available & is included in CVMFS 2.3.5

It’s complicated With latest Docker for RHEL7
default storage driver is OverlayFS: using a standard XFS filesystem get lots of kernel errors host eventually dies (future Docker releases will refuse to run in this situation) with device-mapper storage driver bugs in the RHEL7.3 kernel (it’s old!) result in occasional problems deleting containers We’re using OverlayFS with an XFS partition formatted correctly (“ftype=1”) no problems so far

Worker nodes Using the Docker universe, the pilot jobs are isolated from the host but what about the payload jobs? Worker node Container Pilot job Payload Payload

Containers & unprivileged users
Docker engine daemon runs as root need root access to run containers Many tools have been developed to run containers on batch systems as unprivileged users Shifter (NERSC) Singularity (LBL) udocker (INDIGO - DataCloud) bdocker (INDIGO – DataCloud, upcoming INDIGO-2 release) WLCG has settled on Singularity also very popular in US HPC sites

Singularity What it does
allows a user to run a process (as the same user) in a specified environment provides file isolation process isolation How does this compare to Docker? Docker has more features, including: more namespaces cgroups for resource monitoring & limiting CPU, memory, swap, disk IO, ... network isolation Linux capabilities

Singularity Experiments can run Singularity containers themselves
E.g. CMS model: Worker node payload jobs cannot see other processes on the host or even processes from the pilot payload jobs cannot see any files from the pilot Container Pilot job Container Container Payload Payload Docker container Singularity container

Computing elements My view (ARC CE, HTCondor CE)
experiments use CEs to acquire & provision resources e.g. ATLAS & CMS can request CPUs & memory as needed keep just a single queue per CE could specify OS using RTE in XRSL, e.g. (runtimeenvironment=ENV/OS/EL6) (runtimeenvironment=ENV/OS/EL7) DIRAC has a gliteWMS-style view of the Grid possibly will need to setup dedicated CEs for EL7

Monitoring & traceability
Greater visibility into what each job is doing including networking see what processes are running in each job (without relying on uids) resource usage metrics

Example CMS job s

Current status ~30% batch farm has been migrated to SL7
Have run jobs from all LHC experiments other VOs (ENMR, ILC, SNO+, Pheno, LSST, ...)

Plans Things in progress or planned (short/medium term)
Ceph xrootd gateways on worker nodes also Ceph xrootd proxies on worker nodes configure CEs to provide access to EL7 environments provide Singularity in EL7 environments CMS can migrate from glexec to Singularity at RAL decommission pool accounts on worker nodes they serve no purpose at all, per-slot users are far simpler automated rolling reboots use etcd distributed key-value store for coordinating reboots all worker nodes will drain & reboot themselves when necessary while maintaining MoU committments

Further in the future So now we have the ability to What if
run either SL6 or SL7 jobs from LHC experiments run jobs with other (Linux) OS’s What if someone wants to run SLURM MPI jobs? someone wants to run Spark jobs? we need more hypervisors for the cloud? Move away from dedicated HTCondor worker nodes more flexible, generic nodes if needed as a HTCondor worker node, a scheduler can run all the appropriate containers including HTCondor, CVMFS, ...

Activities Container cluster managers In both cases
Using Mesos as a platform for multiple computing activities & running services Using Kubernetes as an abstraction across multiple clouds In both cases nodes don’t have any grid middleware installed but can run grid worker nodes as needed CVMFS is difficult currently containers usually have private mount namespaces therefore CVMFS from one container not visible anywhere else

Kubernetes RCUK Cloud Working Group Pilot Project investigating
portability between on-prem resources & public clouds portability between multiple public clouds What are we doing that’s different to previous work in HEP? previous work all involved different methods of provisioning VMs using cloud APIs all the major public clouds have different APIs get locked-in to specific clouds a lot of work to move to a different cloud instead we’re using Kubernetes as an abstraction layer only worry about the Kubernetes API public clouds generally provide instant Kubernetes clusters

horizontal pod autoscaler
Kubernetes How Kubernetes is being used to run LHC jobs create a pool of pilots which scales automatically depending on how much work is available (“vacuum” model) squids for CVMFS & Frontier Created by a single command, e.g. kubectl create –f atlas.yaml proxy renewal (cron) custom controller pilot pilot pilot pilot service (stable VIP) horizontal pod autoscaler squid replication controller squid pilot

Kubernetes So far Ongoing work Aim is to run ATLAS jobs on Azure
have successfully run CMS jobs at RAL, Google, AWS, Azure have successfully run ATLAS & LHCb jobs at RAL Ongoing work A RAL Azure site is being setup within ATLAS Azure blob storage has been added to RAL Dynafed FTS3 instance has been setup on Azure Aim is to run ATLAS jobs on Azure up to ~5000 concurrent cores using Azure blob storage via Dynafed Thanks to Google & Amazon for credits & to Microsoft for an Azure Research Award

Other technology

Storing container images
Images stored in a private Docker registry no reliance on external services (including Amazon S3) using Swift storage backend two services registry auth server (authentication, ACLs, ..) Docker registry Ceph (Swift API) Ceph gateways (Swift API) registry auth server

Logstash Increasing usage of Filebeat has lead to a proliferation of VMs running Logstash Started running multiple Logstash instances in containers on 2 machines as a first trial at consolidation logstash logstash filebeat logstash logstash logstash logstash logstash logstash logstash logstash filebeat logstash logstash logstash logstash logstash logstash logstash logstash filebeat filebeat

Load balancers Using HAProxy & Keepalived as HA load balancers in front of some services FTS for over a year site & top BDII since January Dynafed, OpenStack Was particularly useful in hiding a recent HyperV incident from users CERN likely to put HAProxy in front of their FTS instances soon

Monitoring infrastructure
Moving away from Ganglia to more modern & flexible tools Telegraf (metrics collection) InfluxDB (time series database) Grafana (visualisation) 3 InfluxDB instances general services & head nodes Ceph (Echo) worker nodes Currently have over 800 hosts sending metrics to InfluxDB

Summary We’re currently migrating to the HTCondor Docker universe
jobs no longer depend on OS version or software installed on worker nodes gain a lot flexibility Also likely to provide Singularity give experiments the possibilty to run containers within their pilot jobs

Containers and other technology at the Tier-1

Similar presentations

Presentation on theme: "Containers and other technology at the Tier-1"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Containers and other technology at the Tier-1

Similar presentations

Presentation on theme: "Containers and other technology at the Tier-1"— Presentation transcript:

Similar presentations

About project

Feedback