Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th.

Slides:



Advertisements
Similar presentations
Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 GridPP 30, Glasgow, 26th March 2013.
Advertisements

Tier-1 Evolution and Futures GridPP 29, Oxford Ian Collier September 27 th 2012.
HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Cloud & Virtualisation Update at the RAL Tier 1 Ian Collier Andrew Lahiff STFC RAL Tier 1 HEPiX, Lincoln, NEBRASKA, 17 th October 2014.
HTCondor at the RAL Tier-1
Ceph vs Local Storage for Virtual Machine 26 th March 2015 HEPiX Spring 2015, Oxford Alexander Dibbo George Ryall, Ian Collier, Andrew Lahiff, Frazer Barnsley.
WLCG Cloud Traceability Working Group progress Ian Collier Pre-GDB Amsterdam 10th March 2015.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.
Two Years of HTCondor at the RAL Tier-1
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
Virtualisation Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.
Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Cloud services at RAL, an Update 26 th March 2015 Spring HEPiX, Oxford George Ryall, Frazer Barnsley, Ian Collier, Alex Dibbo, Andrew Lahiff V2.1.
Cosener’s House – 30 th Jan’031 LHCb Progress & Plans Nick Brook University of Bristol News & User Plans Technical Progress Review of deliverables.
Grid job submission using HTCondor Andrew Lahiff.
WLCG Cloud Traceability Working Group face to face report Ian Collier 11 February 2015.
1 Resource Provisioning Overview Laurence Field 12 April 2015.
Cloud Status Laurence Field IT/SDC 09/09/2014. Cloud Date Title 2 SaaS PaaS IaaS VMs on demand.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.
Virtualisation & Cloud Computing at RAL Ian Collier- RAL Tier 1 HEPiX Prague 25 April 2012.
Virtualised Worker Nodes Where are we? What next? Tony Cass GDB /12/12.
Two Years of HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier STFC Rutherford Appleton Laboratory 2015 WLCG Collaboration.
Virtualisation at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Annecy, 23rd May 2014.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
The GridPP DIRAC project DIRAC for non-LHC communities.
Monitoring with InfluxDB & Grafana
HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.
The GridPP DIRAC project DIRAC for non-LHC communities.
Ian Collier, STFC, Romain Wartel, CERN Maintaining Traceability in an Evolving Distributed Computing Environment Introduction Security.
StratusLab is co-funded by the European Community’s Seventh Framework Programme (Capacities) Grant Agreement INFSO-RI Demonstration StratusLab First.
Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.
STFC in INDIGO DataCloud WP3 INDIGO DataCloud Kickoff Meeting Bologna April 2015 Ian Collier
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.
Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.
EGI-Engage is co-funded by the Horizon 2020 Framework Programme of the European Union under grant number Federated Cloud Update.
HTCondor Accounting Update
Review of the WLCG experiments compute plans
Running LHC jobs using Kubernetes
C Loomis (CNRS/LAL) and V. Floros (GRNET)
Virtualisation for NA49/NA61
Dag Toppe Larsen UiB/CERN CERN,
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Dag Toppe Larsen UiB/CERN CERN,
ATLAS Cloud Operations
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Moving from CREAM CE to ARC CE
SCD Cloud at STFC By Alexander Dibbo.
Virtualisation for NA49/NA61
CREAM-CE/HTCondor site
WLCG Collaboration Workshop;
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Presentation transcript:

Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th 2015

RAL Tier 1 Context Scientific Computing Department Cloud Batch farm related virtualisation

Context RAL Tier 1 sits in Science & Technology Facilities Council (STFC) Scientific Computing Department Primary project is WLCG Tier 1 For many reasons we seek to make our resources more adaptable, and more readily used by other projects –Many department projects, for example those funded by Horizon 2020 will benefit from an elastic and scalable resource –Self-service internal development systems –New interfaces for WLCG VOs –Investigating providing compute resources to other parts of STFC (Diamond Light Source, ISIS Neutron Source) for whom we already provide data services

Context March 2014 secured funding for 30 well specified hypervisors & 30 storage servers –~1000 cores –~ 4TB ram –~1PB raw storage In September secured first dedicated effort to turn earlier experiments into a service with a defined service level This all builds on some 2 ½ years of experiments

Virtualisation & RAL Context Scientific Computing Department Cloud Dynamically provisioned worker nodes

History - SCD Cloud Began as small experiment 3 years ago –Initially using StratusLab & old worker nodes –Initially very quick and easy to get working –But fragile, and upgrades and customisations always harder Work until now implemented by graduates on 6 month rotation –Disruptive & variable progress Worked well enough to prove usefulness Something of an exercise in managing expectations Self service VMs proved very popular

SCD Cloud Present Carried out a fresh technology evaluation. –Things have moved on since we began with StratusLab –Chose Opennebula with ceph backend Now a service with defined (if limited) service level for users across STFC –integrated in to the existing Tier 1 configuration & monitoring frameworks IaaS upon which we can offer PaaS –One platform could ultimately be the Tier 1 itself –Integrating cloud resources in to Tier 1 grid work

SCD Cloud Setup –OpenNebula –Ceph for image store and running images –Collaborating with UGent on configuration tools for both OpenNebula and ceph –AD as initial authentication Covers all STFC staff and many partners Will add others as use cases require 2 staff began work in September one full time, one half time

SCD Cloud Ceph –Separate project to identify suitable alternatives to Castor for our grid storage identified ceph as being interesting –Have been building up experience over last 2 years –Have separate test bed (half a generation of standard disk servers (1.7PB raw) being tested for grid data storage Ceph as cloud backend –Image store, running images, and possibly a data service coupled to the cloud –Performance tests show little difference compared to local storage for running VM images & instantiation is much faster

SCD Cloud Summer (high school) student developed web interface –Kerberos authentication –Streamlined selection of images & VM templates –VNC console to running machines

SCD Cloud Being launched on self service basis to entire Scientiifc Computing department later today Still a development platform – capabilities will be changing rapidly Integrated into Tier 1 monitoring & exception handling –NO out of hours support OpenNebula and Ceph managed by Quattor/Aquilon –Result of collaboration with Ugent Standalone VMs and VMs managed by Quattor/Aquilon Early H2020 funded project - INDIGO-DataCloud

RAL Context Scientific Computing Department Cloud Batch farm related virtualisation

Bursting the batch system into the cloud Last year spoke about leveraging HTCondor power management features to dynamically burst batch work in to cloud Aims –Integrate cloud with batch system –First step: allow the batch system to expand into the cloud Avoid running additional third-party and/or complex services Leverage existing functionality in HTCondor as much as possible Proof-of-concept testing carried out with StratusLab –Successfully ran ~11000 jobs from the LHC VOs We can now ensure our private cloud is always used –LHC VOs can be depended upon to provide work

Initial situation: partitioned resources –Worker nodes (batch system) –Hypervisors (cloud) Likely to be a common situation at sites providing both batch & cloud resources Ideal situation: completely dynamic –If batch system busy but cloud not busy Expand batch system into the cloud –If cloud busy but batch system not busy Expand size of cloud, reduce amount of batch system resources 14 Bursting the batch system into the cloud cloudbatch cloudbatch

Based on existing power management features of HTCondor Virtual machine instantiation –ClassAds for offline machines are sent to the collector when there are free resources in the cloud –Negotiator can match idle jobs to the offline machines –HTCondor rooster daemon notices this match & triggers creation of VMs Virtual machine lifetime –Managed by HTCondor on the VM itself. Configured to: Only start jobs when a health-check script is successful Only start new jobs for a specified time period Shuts down the machine after being idle for a specified period –Virtual worker nodes are drained when free resources on the cloud start to fall below a specified threshold 20/05/2014HEPiX Spring RAL Site Report

20/05/2014HEPiX Spring RAL Site Report condor_collectorcondor_negotiator Worker nodes condor_startd condor_rooster Virtual worker nodes condor_startd ARC/CREAM CEs condor_schedd Central manager Offline machine ClassAds Draining

Expansion into cloud 20/05/2014HEPiX Spring RAL Site Report Cores in the cloud Running & idle jobs Idle Used in batch Used (not batch) Idle jobs Running jobs

Bursting the batch system into the cloud Last year this was a short term experiment with StratusLab Our cloud is now entering production status Ability to expand batch farm into our cloud is being integrated into our production batch system The challenge is to have a variable resource so closely bound to our batch service HTCondor makes it much easier – elegant support for dynamic resources But significant changes to monitoring –Moved to the condor health check – no nagios on virtual WNs –This has in turn fed back in to the monitoring of bare metal WNs

The Vacuum Model “Vacuum” model is becoming popular in the UK –Alternative to CE + batch system or clouds –No centrally-submitted pilot job or requests for VMs –VMs appear by “spontaneous production in the vacuum” –VMs run the appropriate pilot framework to pull down jobs –Discussed by Jeremy Coles’ in his talk on Tuesday Can we incorporate the vacuum model into our existing batch system? –HTCondor has a “VM universe” for managing VMs

Vacuum Model & HTCondor Makes use of not-commonly used features of HTCondor, including –Job hooks, custom transfer plugins, condor_chirp Features –Uses same config file as Vac –Images downloaded & cached on worker nodes –Quarantining of disk images after VMs are shutdown –Accounting data sent directly to APEL –Stuck VMs killed by PeriodicRemove expression

VM lifecycle in HTCondor 20/05/2014HEPiX Spring RAL Site Report Download disk image or copy cached image to job sandbox Setup sparse disk for CVMFS cache, create contextualization iso, … VM created Update time of last heartbeat from VM Copy disk image to quarantine area, add ShutdownCode from VM to job ClassAd Transfer plugin Job prepare hook condor_vm-gahp Job update hook condor_chirp Job exit hook condor_chirp

The Vacuum Model & HTCondor Usage –Successfully running regular SAM tests from the GridPP DIRAC instance –Running ATLAS jobs

RAL Context Scientific Computing Department Cloud Batch farm related virtualisation

Summary Private cloud has developed from a small experiment to a service with a defined service level –With constrained effort - Slower than we would have liked –The prototype platforms has been well used –Ready to provide resources to funded projects on schedule. Demonstrated transparent expansion of batch farm into cloud and Vacuum model. Whole Tier 1 service becoming more flexible

Questions ?