Download presentation
Presentation is loading. Please wait.
Published byWalter Quinn Modified over 7 years ago
1
Container Orchestration - Simplifying Use of Public Clouds
Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory 27th April 2017, HEPiX Spring 2017 Workshop
2
Overview Introduction Kubernetes Tests with CMS, ATLAS, LHCb
Clouds in HEP Kubernetes As an abstraction layer On public clouds Running LHC jobs Tests with CMS, ATLAS, LHCb Summary & future plans It might be worth mentioning that the cloud providers stressed *very* strongly that we are not allowed to publicly discuss any performance measurements or performance-related comparisons between different clouds, so unfortunately will be no performance plots or numbers in this presentation.
3
Clouds in HEP Clouds have been used in HEP for many years
as a flexible infrastructure upon which traditional grid middleware can be run as an alternative to traditional grid middleware as a way of obtaining additional resources quickly potentially very large amounts of resources, e.g. to meet a deadline Generally this has always involved different methods for provisioning VMs on demand, which join a HTCondor pool, or directly run an experiment’s pilot
4
Clouds in HEP Limitations with this approach There is no portability
each cloud provider has a different API a lot of effort is required to develop software to interact with each API some investigations have made use of cloud-provider specific services (sometimes many services) this is a problem if you need or want to use resources from a different cloud provider There is no portability between on-premises resources & public clouds, or between different public clouds Is there another way? Even when different clouds provide the “same” API there are differences, e.g. I remember it took a long time for all the bugs & quirks to be sorted out for condor managing VMs on OpenStack via EC2. Example: condor_annex, which is meant to simplify bursting a condor pool onto AWS, uses 10 AWS servces (EBS, EC2, CloudWatch, SNS, VPC, S3, ASG, CloudFormat, IAM, Lambda). This is not something can be immediately used on a different cloud.
5
Kubernetes Kubernetes is an open-source container cluster manager
originally developed by Google, donated to the Cloud Native Computing Foundation schedules & deploys containers onto a cluster of machines e.g. ensure that a specified number of instances of an application are running provides service discovery, configuration & secrets, ... provides access to persistent storage Terminology used in this talk: pod smallest deployable unit of compute consists of one or more containers that are always co-located, co-scheduled & run in a shared context Pods share the same network interface, IPC namespace, can share volumes, ...
6
Why Kubernetes? It can be run anywhere: on premises & multiple public clouds Idea is to use Kubernetes as an abstraction layer migrate to containerised applications managed by Kubernetes & use only the Kubernetes API can then run out-of-the-box on any Kubernetes cluster Avoid vendor lock-in as much as possible by not using any vendor specific APIs or services Can make use of (or contribute to!) developments from a much bigger community than HEP other people using Kubernetes want to run on multiple public clouds or burst to public clouds
7
Why not Mesos? DC/OS is available on public clouds
DC/OS is the Mesosphere Mesos “distribution” DC/OS contains a subset of Kubernetes functionality but contains over 20 services in Kubernetes this is all within a couple of binaries One (big) limitation: security Open-source DC/OS has no security enabled at all communication between Mesos masters, agents & frameworks is all insecure no built-in way to distribute secrets securely Not ideal for running jobs from third-parties (i.e. LHC VOs) “Plain” Mesos also not a good choice for our use-case would need to deploy complex infrastructure on each cloud DC/OS contains functionality for service discovery, providing virtual IPs (as well as DNS). But all of this is provided by many individual services (run by systemd), rather than being included within 1 or 2 binaries (like Kubernetes). The enterprise edition of DC/OS doens’t have these limitations, i.e. everything is secure & it’s possible to securely distribute secrets. Also note that work is currently in progress in adding the required functionality to Mesos so that secrets can be made available in containers. Thre are no standard tools or scripts for deploying “plain” Mesos on different public clouds. Lots of other services need to be deployed to make it useable (Mesos on its own is just like having a Linux kernel and nothign else), so it’s not exactly appropriate where portability is important.
8
Instead of this... Each cloud has a different API VM provisioner
Nova Compute API, EC2 Compute Engine API OpenNebula XML-RPC, EC2 EC2 Azure API
9
Container provisioner
...we have this A single API for accessing multiple resources On-premises resources & public clouds used in exactly the same way Container provisioner Kubernetes Kubernetes Kubernetes Kubernetes Kubernetes
10
Kubernetes on public clouds
Availability on the major cloud providers: There are a variety of CLI tools available automate deployment of Kubernetes enable production-quality Kubernetes clusters on AWS & other clouds to be deployed quickly & easily some include node auto-scaling Also third-party companies providing Kubernetes-as-a-Service on multiple public clouds Native support Node auto-scaling Google ✔ Azure ✗ AWS It’s important to note that the containers run on VMs in (most) public clouds, which is why node auto-scaling is important. I would expect Azure to eventually provide node auto-scaling. Examples of Kubernetes as a service: StackPointCloud (which is what I tried) and the new KUBE2GO (“Run Kubernetes Anywhere. Instantly. Free.”).
11
Running LHC jobs In principle this is straightforward, the main requirements are squids for CVMFS & Frontier auto-scaling pool of pilots/worker nodes with CVMFS credentials for joining the HTCondor pool or pilot system Squids use a Replication Controller ensures a specified number of squid instances are running use a Service to provide access, e.g. via DNS alias for a virtual IP which points at the squids The virtual IP address is provided by “kube-proxy”, which is a daemon running on every node in a Kubernetes cluster. It configures IP tables rules dynamically to redirect traffic to the appropriate pods. Without this there would be no easy way to find the IP addresses of the squids (which can of course change with time if one squid dies and is recreated). Kubernetes usually be default has its own DNS servers running (as pods in the cluster). For the squid example, these provide the DNS alias (“squid”) for the virtual IP address. replication controller squid service (stable VIP) pilot
12
Pool of workers Need to have a pool of worker pods which
scale up if there’s work scale down when there’s not as much work Initially tried using existing Kubernetes functionality only replication controller + horizontal pod auto-scaler scales the number of pods to maintain a target CPU load CMS Monte-Carlo production jobs submitted - number of pods increases Jobs killed - number of pods quickly decreases Test on Azure with just under 700 cores
13
Pool of workers For CPU intensive activities this works reasonably well, but there are issues Auto-scaling based on CPU usage is of course not perfect CPU usage can vary during the life of a pilot (or job) Kubernetes won’t necessarily kill ‘idle’ pods when scaling down quite likely that running jobs will be killed this is a problem, e.g. ATLAS “lost heartbeat” errors Alternative approach wrote a custom controller pod which creates the worker pods scales up when necessary using the “vacuum” model idea downscaling occurs only by letting worker pods exit The worker node / pilot pods are designed to have a limited lifetime. There are Kubernetes github issues open about wanting more intelligent choices for which pods to kill when downscaling. The custom controller pod is essentially just a python script running in a container which accesses the Kubernetes API to create more worker nodes as needed. Kubernetes automatically provides a token in each pod which can be used to authenticate with the API (you can configure what privileges these tokens will have). Basically, if worker pods are running for more than a specified duration (and don’t fail) it assumes they are doing useful work, and more are created. If pods start exiting or failing then it backs off. Use essentially the same algorithm used in Vac/Vcycle.
14
Credentials Need to provide a X509 proxy to authenticate to the experiment’s pilot framework store a host certificate & key as a Kubernetes Secret cron which runs a container which creates a short-lived proxy from the host certificate updates a Secret with this proxy Containers running jobs proxy made available as a file new containers will have the updated proxy the proxy is updated in already running containers
15
What about CVMFS? It’s complicated Therefore
Currently Kubernetes only allows containers to have private mount namespaces CVMFS in one container is invisible to others on the same host Installing CVMFS directly on the agents is not an option this would defeat the whole purpose of using Kubernetes as an abstraction layer Therefore For proof-of-concept testing we’re using privileged containers both providing CVMFS & running jobs Once Kubernetes supports non-private volume mounts we will be able to separate CVMFS & jobs There are a number of Kubernetes github issues & pull requests open about providing non-private volume mounts. Since this gives containers the ability to modify the hosts’ mount namespace, people are being very careful about it. Plus it needs to work on different Linux distributions, different container runtimes (Docker, rkt) etc, so unfortunately it’s not trivial. Once this is resolved CVMFS can be provided by one container, then this CVMFS can be mounted into other containers on the same host. In case anyone asks about the CERN CVMFS Docker volume plugin – this would need to be installed directly on the Kubernetes agents. Even though it can be run in a container (which is the recommended way to deploy it), it requires sharing the same mount namespace as the host so it cannot (yet) be delpoyed by Kubernetes.
16
Running LHC jobs on Kubernetes
Summary of what we’re deploying on Kubernetes it’s quite straightforward proxy renewal pod (cron) custom controller pod pilot pod (+ CVMFS) pilot pilot pilot service (stable VIP) squid replication controller squid pod pilot Note that the replication controller isn’t something running in a pod, it’s handled by the controller-manager (part of the Kubernetes Master). $ kubectl get pods NAME READY STATUS RESTARTS AGE atlaspilot-0jjx / Running m atlaspilot-5p7z / Running m squid sst / Running d squid bvvn / Running d vacuum-atlas t85rk 2/ Running s
17
Resources used for testing
On-premises resources & public clouds RAL Kubernetes cluster on bare metal Microsoft Azure initially used a CLI tool from Weaveworks, then changed to Azure Container Service Google Google Container Engine Amazon Used StackPointCloud to provision Kubernetes clusters For the Kubernetes cluster at RAL used worker node hardware. Used “kubeadm” to deploy the cluster, which is very simple to do (install Docker + 3 rpms on all machines, then run 1 command to create a Master node then run 1 command on each node). Used Weave for networking.
18
Initial tests with CMS analysis jobs
Successfully ran CRAB3 analysis jobs with a test glideinWMS instance at RAL Performance (time per event, CPU efficiency, wall time, job success rate) was similar across the different resources Kubernetes RAL glideinWMS HTCondor pool squid squid startd startd Reminder that we’re not allowed to show any numbers or plots relating to performance since we didn’t pay for the resources ourselves. jobs created CERN CRAB server RAL, Azure, Google, AWS submit workflow CRAB: system for running CMS analysis glideinWMS: pilot framework used by CMS
19
ATLAS & LHCb Have run real ATLAS & LHCb jobs on Kubernetes at RAL
at a small scale Using containers which run a startd which joins an existing ATLAS HTCondor pool run the LHCb DIRAC agent successful jobs ATLAS Cluster resource limits & usage
20
ATLAS status & plans Planning to do larger scale tests with ATLAS on Azure soon A RAL Azure site has been configured in AGIS & PANDA have successfully run test jobs would like to test up to ~5000 running jobs Step 1 use RAL storage (Echo) via GridFTP Step 2 use Azure Blob storage via RAL Dynafed can use gfal commands to access Azure Blob storage Why hasn’t the testing started already? It’s a bit of a long story. ATLAS are migrating to a new pilot code, and we ended up with the new code and encountered some bugs affecting stage-out. Has been fixed by the pilot developer but waiting for him to come back from vacation so it can go into production. UPDATE ( ) Hammercloud test jobs have started running successfully.
21
Federations Provides the standard Kubernetes API, but applies across multiple Kubernetes clusters New, but eventually should make it easy to overflow from on-premise resources to public clouds A Kubernetes federation provides the same API as an individual Kubernetes cluster Container provisioner Kubernetes federation The design document for Kubernetes federations includes having the ability to automatically overflow from on-premises resources to public clouds (but this hasn’t been developed yet). Kubernetes Kubernetes Kubernetes Kubernetes Kubernetes
22
Summary Kubernetes is a promising way of running LHC workloads
enables portability between on-premises resources & public clouds no longer need to do one thing for grid sites, different things for clouds have successfully run ATLAS, CMS & LHCb jobs Looking forward to larger scale tests with real ATLAS jobs
23
Thankyou Questions? Acknowledgements Microsoft (Azure Research Award)
Google Amazon Alastair Dewhurst, Peter Love, Paul Nilsson (ATLAS) Andrew McNab (LHCb)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.