Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.

Slides:

Advertisements

Similar presentations

Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 GridPP 30, Glasgow, 26th March 2013.

Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.

Tier-1 Evolution and Futures GridPP 29, Oxford Ian Collier September 27 th 2012.

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Cloud & Virtualisation Update at the RAL Tier 1 Ian Collier Andrew Lahiff STFC RAL Tier 1 HEPiX, Lincoln, NEBRASKA, 17 th October 2014.

HTCondor at the RAL Tier-1

Ceph vs Local Storage for Virtual Machine 26 th March 2015 HEPiX Spring 2015, Oxford Alexander Dibbo George Ryall, Ian Collier, Andrew Lahiff, Frazer Barnsley.

HTCondor within the European Grid & in the Cloud

WLCG Cloud Traceability Working Group progress Ian Collier Pre-GDB Amsterdam 10th March 2015.

CMS Diverse use of clouds David Colling

CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

Two Years of HTCondor at the RAL Tier-1

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

Virtualisation Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013.

Ian C. Smith Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs.

HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

Ian Alderman A Little History…

| nectar.org.au NECTAR TRAINING Module 5 The Research Cloud Lifecycle.

Cloud services at RAL, an Update 26 th March 2015 Spring HEPiX, Oxford George Ryall, Frazer Barnsley, Ian Collier, Alex Dibbo, Andrew Lahiff V2.1.

RAL PPD Computing A tier 2, a tier 3 and a load of other stuff Rob Harper, June 2011.

Grid job submission using HTCondor Andrew Lahiff.

1 Resource Provisioning Overview Laurence Field 12 April 2015.

Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.

An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.

Virtualisation & Cloud Computing at RAL Ian Collier- RAL Tier 1 HEPiX Prague 25 April 2012.

Virtualised Worker Nodes Where are we? What next? Tony Cass GDB /12/12.

Two Years of HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier STFC Rutherford Appleton Laboratory 2015 WLCG Collaboration.

HTCondor & ARC CEs Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier GridPP March 2014, Pitlochry, Scotland.

Security Vulnerabilities in A Virtual Environment

Virtualisation at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Annecy, 23rd May 2014.

A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK

Monitoring with InfluxDB & Grafana

HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.

CernVM-FS Infrastructure for EGI VOs Catalin Condurache - STFC RAL Tier1 EGI Webinar, 5 September 2013.

CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.

Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.

Ian Collier, STFC, Romain Wartel, CERN Maintaining Traceability in an Evolving Distributed Computing Environment Introduction Security.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

STFC in INDIGO DataCloud WP3 INDIGO DataCloud Kickoff Meeting Bologna April 2015 Ian Collier

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.

Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th.

Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.

Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.

CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.

Dag Toppe Larsen UiB/CERN CERN,

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

Dag Toppe Larsen UiB/CERN CERN,

ATLAS Cloud Operations

HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL

Moving from CREAM CE to ARC CE

SCD Cloud at STFC By Alexander Dibbo.

CREAM-CE/HTCondor site

PES Lessons learned from large scale LSF scalability tests

“Better to avoid problems than deal with them when they occur”

Main Memory Management

Managing Clouds with VMM

Presentation transcript:

Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014

2 Outline STFC Scientific Computing Department cloud What we did –The problem –The RAL batch system –Provisioning worker nodes –VM lifetime –Images –Fairshares Testing Monitoring

3 SCD cloud Prototype cloud –Gain practical experience –Test potential use cases of a private cloud Using StratusLab –Based on OpenNebula –Easy to setup – Quattor configuration provided Using iSCSi & LVM based persistent disk storage –Caches images –Instantiation very fast, ~20-30 secs or less Hardware –Cloud front end on a VM (hosted on Hyper-V) –Persistent disk store on a 18TB retired disk server –Hypervisors: ~ 100 retired worker nodes (8 cores, 16 GB RAM)

4 SCD cloud Usage –Available to all SCD staff on a self-service basis –Very useful for testing & development –Around 30 active users However –The cloud resources are largely idle –Our batch system has the opposite problem Many idle jobs, not enough resources Can we easily make use of the idle cloud resources in our batch system?

Initial situation: partitioned resources –Worker nodes (batch system) –Hypervisors (cloud) Likely to be a common situation at sites providing both grid & cloud resources Ideal situation: completely dynamic –If batch system busy but cloud not busy Expand batch system into the cloud –If cloud busy but batch system not busy Expand size of cloud, reduce amount of batch system resources 5 The problem cloudbatch cloudbatch

6 The problem First step –Still with partitioned resources Hypervisors Physical worker nodes –Allow the batch system to expand into the cloud Using virtual worker nodes Main aspects to consider –Provisioning virtual machines as they are required –Adding new virtual worker nodes to the batch system –Removing virtual worker nodes

Last year RAL migrated from Torque + Maui to HTCondor HTCondor has a somewhat different design to other batch systems –Jobs states their requirements & preferences –Machines (WNs) state their requirements & preferences –Represented as ClassAds List of AttributeName = Value or AttributeName = Expression Stored by the collector –Negotiator matches jobs to machines HTCondor therefore works very well with dynamic resources –This is what it was originally designed for –Adding & removing worker nodes is trivial New worker nodes advertise to the collector Provided they have appropriate privileges to join the HTCondor pool, can run jobs –No complicated procedures that other batch systems may require No hard-wired list of WNs No pre-configuration of potential WNs No restarting of any services 7 RAL batch system

8 Provisioning worker nodes We have chosen to –Avoid running additional third-party and/or complex services –Use existing functionality in the batch system as much as possible HTCondor’s existing power management features –Entering a low power state HIBERNATE expression can define when a slot is ready to enter a low power state When true for all slots, machine will go into the specified low power state –Machines in the low power state are “offline” The collector can keep offline ClassAds –Returning from a low power state condor_rooster daemon responsible for waking up hibernating machines By default will send UDP Wake On LAN –But this can be replaced by a user-defined script –When there are idle jobs Negotiator can match jobs to offline ClassAds condor_rooster daemon notices this match & wakes up the machine

Advertise appropriate offline ClassAd(s) to the collector –Hostname used is a random string –In our use case these represents types of VMs, rather than specific machines E.g. for VO-specific VMs, have an offline ClassAd for each type of VM condor_rooster –Enable this daemon –Configure to run appropriate command to instantiate a VM Contextualisation using CloudInit HTCondor pool password inserted into the VM Volatile disks created on hypervisor’s local disk for: –/tmp –job scratch area –CVMFS cache When there are idle jobs –Negotiator can match jobs to the offline ClassAd Configured so that online machines are preferred to offline –condor_rooster daemon notices this match Instantiates a VM 9 Provisioning worker nodes

Batch system 10 Provisioning worker nodes condor_collectorcondor_negotiator Worker nodes condor_startd ARC/CREAM CEs condor_schedd Central manager

Batch system & cloud resources 11 Provisioning worker nodes condor_collectorcondor_negotiator Worker nodes condor_startd StratusLab ARC/CREAM CEs condor_schedd Central manager

Using cloud resources 12 Provisioning worker nodes condor_collectorcondor_negotiator Worker nodes condor_startd condor_rooster Virtual worker nodes condor_startd StratusLab ARC/CREAM CEs condor_schedd Central manager

13 VM lifetime Using short-lived VMs –Only accept jobs for a limited time period before shutting down HTCondor on the worker node controls everything –START expression New jobs allowed to start only for a limited time period since the VM was instantiated New jobs allowed to start only if the VM is healthy –HIBERNATE expression VM is shutdown after machine has been idle for too long Very simple –Don’t need external systems trying to determine if VMs are idle or not

14 VM lifetime Benefits of short-lived VMs –Rolling updates easy, e.g. kernel errata Problems –Inefficiencies due to constantly draining worker nodes –Could be resolved by Using single core VMs rather than multicore Run short jobs if possible while long running jobs are finishing Alternatively, could use long-running VMs –Giving back resources to other cloud users takes longer

15 Health-check Worker nodes are being created dynamically –Need to ensure we’re not just creating black holes Using same health-check script as on our physical HTCondor worker nodes –Runs as a HTCondor startd cron Puts information into ClassAds as a result of executing a custom script –Current checks: CVMFS Read-only file system Space left on job scratch area partition Swap usage –Prevents jobs from starting if there’s a problem E.g. if atlas.cern.ch CVMFS broken, only prevents new ATLAS jobs from starting Broken virtual WNs will remain idle & eventually die

16 Images EMI-2 SL6 worker node, same as our physical production worker nodes –Only images created by RAL Tier-1 sys admins are used Therefore are trusted –Experiments/users can’t provide images In the future this could change Privileges compared to physical worker nodes –Generally the same E.g. read/write permissions for storage system (CASTOR) –Except HTCondor pool password not built in to images –inserted during instantiation

17 Fairshares Opportunistic use of the cloud –Almost always have idle jobs in the batch system Would just completely take over the cloud –Can specify a fixed quota, but this isn’t really enough –Unfortunately clouds don’t support fairshares Currently using a simple method –Choose cores & memory per VM for cloud worker nodes such that there will always be some resources left per hypervisor E.g. with 8 core 16 GB hypervisors, use 4 core 12 GB VMs –Simple starting point, but not ideal E.g. what if other users want VMs with large numbers of CPUs or memory?

Testing in production –Initial test with production HTCondor pool –Ran around 11,000 real jobs, including jobs from all LHC VOs –Started with 4-core 12GB VMs, then changed to 3-core 9GB VMs 18 Testing Hypervisors enabled condor_rooster disabled Starting using 3-core VMs

Load on persistent disk storage –Only /tmp, job scratch area, CVMFS cache on disk local to the hypervisor. Everything else on p-disk server. –Expected this to be the main source of scaling problems In small-scale testing there wasn’t much load 19 Testing

Physical worker node monitoring at the RAL Tier-1 –Pakiti Monitors status of nodes with respect to patches –Logging Log files (e.g. /var/log/messages) sent to central loggers –Nagios 28 checks, including –CVMFS repositories –Space left on different partitions –Swap –Load –Dmesg –Linux NTP drift –Temperature –… 20 Monitoring

Issues with dynamic resources –Nagios designed for static resources Can’t handle machines coming & going Probably don’t need Nagios at all for virtual worker nodes –Provided jobs won’t start if WN is broken and broken WNs eventually shutdown –Maintaining a history of VMs What happens if a VM is created, causes lots of jobs to fail before finally shutting down? What happens if there is a security incident involving a short-lived virtual WN? –Effect on infrastructure What if the batch system encountered scaling problems overnight when lots of new worker nodes were added? 21 Monitoring

Developed a method for including dynamically-provisioned worker nodes into the RAL Tier-1 batch system –Very simple –Relies entirely on leveraging existing HTCondor functionality –Successfully tested Next steps –StratusLab cloud will be decommissioned –Setting up production OpenNebula cloud 900 vCPUs, 3.5 TB RAM, ~1 PB storage (Ceph) –Move dynamically-provisioned worker nodes into production –Development/testing required Improvements in handling fairshare between batch system & cloud resources Making the boundary between batch system & cloud completely elastic 22 Summary

Questions? Comments? 23