Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th.

Similar presentations


Presentation on theme: "Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th."— Presentation transcript:

1 Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th 2015

2 Virtualisation @ RAL Tier 1 Context Scientific Computing Department Cloud Batch farm related virtualisation

3 Context RAL Tier 1 sits in Science & Technology Facilities Council (STFC) Scientific Computing Department Primary project is WLCG Tier 1 For many reasons we seek to make our resources more adaptable, and more readily used by other projects –Many department projects, for example those funded by Horizon 2020 will benefit from an elastic and scalable resource –Self-service internal development systems –New interfaces for WLCG VOs –Investigating providing compute resources to other parts of STFC (Diamond Light Source, ISIS Neutron Source) for whom we already provide data services

4 Context March 2014 secured funding for 30 well specified hypervisors & 30 storage servers –~1000 cores –~ 4TB ram –~1PB raw storage In September secured first dedicated effort to turn earlier experiments into a service with a defined service level This all builds on some 2 ½ years of experiments

5 Virtualisation & Cloud @ RAL Context Scientific Computing Department Cloud Dynamically provisioned worker nodes

6 History - SCD Cloud Began as small experiment 3 years ago –Initially using StratusLab & old worker nodes –Initially very quick and easy to get working –But fragile, and upgrades and customisations always harder Work until now implemented by graduates on 6 month rotation –Disruptive & variable progress Worked well enough to prove usefulness Something of an exercise in managing expectations Self service VMs proved very popular

7 SCD Cloud Present Carried out a fresh technology evaluation. –Things have moved on since we began with StratusLab –Chose Opennebula with ceph backend Now a service with defined (if limited) service level for users across STFC –integrated in to the existing Tier 1 configuration & monitoring frameworks IaaS upon which we can offer PaaS –One platform could ultimately be the Tier 1 itself –Integrating cloud resources in to Tier 1 grid work

8 SCD Cloud Setup –OpenNebula –Ceph for image store and running images –Collaborating with UGent on configuration tools for both OpenNebula and ceph –AD as initial authentication Covers all STFC staff and many partners Will add others as use cases require 2 staff began work in September one full time, one half time

9 SCD Cloud Ceph –Separate project to identify suitable alternatives to Castor for our grid storage identified ceph as being interesting –Have been building up experience over last 2 years –Have separate test bed (half a generation of standard disk servers (1.7PB raw) being tested for grid data storage Ceph as cloud backend –Image store, running images, and possibly a data service coupled to the cloud –Performance tests show little difference compared to local storage for running VM images & instantiation is much faster

10 SCD Cloud Summer (high school) student developed web interface –Kerberos authentication –Streamlined selection of images & VM templates –VNC console to running machines

11 SCD Cloud Being launched on self service basis to entire Scientiifc Computing department later today Still a development platform – capabilities will be changing rapidly Integrated into Tier 1 monitoring & exception handling –NO out of hours support OpenNebula and Ceph managed by Quattor/Aquilon –Result of collaboration with Ugent Standalone VMs and VMs managed by Quattor/Aquilon Early H2020 funded project - INDIGO-DataCloud

12 Cloud @ RAL Context Scientific Computing Department Cloud Batch farm related virtualisation

13 Bursting the batch system into the cloud Last year spoke about leveraging HTCondor power management features to dynamically burst batch work in to cloud Aims –Integrate cloud with batch system –First step: allow the batch system to expand into the cloud Avoid running additional third-party and/or complex services Leverage existing functionality in HTCondor as much as possible Proof-of-concept testing carried out with StratusLab –Successfully ran ~11000 jobs from the LHC VOs We can now ensure our private cloud is always used –LHC VOs can be depended upon to provide work

14 Initial situation: partitioned resources –Worker nodes (batch system) –Hypervisors (cloud) Likely to be a common situation at sites providing both batch & cloud resources Ideal situation: completely dynamic –If batch system busy but cloud not busy Expand batch system into the cloud –If cloud busy but batch system not busy Expand size of cloud, reduce amount of batch system resources 14 Bursting the batch system into the cloud cloudbatch cloudbatch

15 Based on existing power management features of HTCondor Virtual machine instantiation –ClassAds for offline machines are sent to the collector when there are free resources in the cloud –Negotiator can match idle jobs to the offline machines –HTCondor rooster daemon notices this match & triggers creation of VMs Virtual machine lifetime –Managed by HTCondor on the VM itself. Configured to: Only start jobs when a health-check script is successful Only start new jobs for a specified time period Shuts down the machine after being idle for a specified period –Virtual worker nodes are drained when free resources on the cloud start to fall below a specified threshold 20/05/2014HEPiX Spring 2014 - RAL Site Report

16 20/05/2014HEPiX Spring 2014 - RAL Site Report condor_collectorcondor_negotiator Worker nodes condor_startd condor_rooster Virtual worker nodes condor_startd ARC/CREAM CEs condor_schedd Central manager Offline machine ClassAds Draining

17 Expansion into cloud 20/05/2014HEPiX Spring 2014 - RAL Site Report Cores in the cloud Running & idle jobs Idle Used in batch Used (not batch) Idle jobs Running jobs

18 Bursting the batch system into the cloud Last year this was a short term experiment with StratusLab Our cloud is now entering production status Ability to expand batch farm into our cloud is being integrated into our production batch system The challenge is to have a variable resource so closely bound to our batch service HTCondor makes it much easier – elegant support for dynamic resources But significant changes to monitoring –Moved to the condor health check – no nagios on virtual WNs –This has in turn fed back in to the monitoring of bare metal WNs

19 The Vacuum Model “Vacuum” model is becoming popular in the UK –Alternative to CE + batch system or clouds –No centrally-submitted pilot job or requests for VMs –VMs appear by “spontaneous production in the vacuum” –VMs run the appropriate pilot framework to pull down jobs –Discussed by Jeremy Coles’ in his talk on Tuesday Can we incorporate the vacuum model into our existing batch system? –HTCondor has a “VM universe” for managing VMs

20 Vacuum Model & HTCondor Makes use of not-commonly used features of HTCondor, including –Job hooks, custom transfer plugins, condor_chirp Features –Uses same config file as Vac –Images downloaded & cached on worker nodes –Quarantining of disk images after VMs are shutdown –Accounting data sent directly to APEL –Stuck VMs killed by PeriodicRemove expression

21 VM lifecycle in HTCondor 20/05/2014HEPiX Spring 2014 - RAL Site Report Download disk image or copy cached image to job sandbox Setup sparse disk for CVMFS cache, create contextualization iso, … VM created Update time of last heartbeat from VM Copy disk image to quarantine area, add ShutdownCode from VM to job ClassAd Transfer plugin Job prepare hook condor_vm-gahp Job update hook condor_chirp Job exit hook condor_chirp

22 The Vacuum Model & HTCondor Usage –Successfully running regular SAM tests from the GridPP DIRAC instance –Running ATLAS jobs

23 Cloud @ RAL Context Scientific Computing Department Cloud Batch farm related virtualisation

24 Summary Private cloud has developed from a small experiment to a service with a defined service level –With constrained effort - Slower than we would have liked –The prototype platforms has been well used –Ready to provide resources to funded projects on schedule. Demonstrated transparent expansion of batch farm into cloud and Vacuum model. Whole Tier 1 service becoming more flexible

25 Questions ?


Download ppt "Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th."

Similar presentations


Ads by Google