Cloud services at RAL, an Update 26 th March 2015 Spring HEPiX, Oxford George Ryall, Frazer Barnsley, Ian Collier, Alex Dibbo, Andrew Lahiff V2.1.

Slides:

Advertisements

Similar presentations

Wei Lu 1, Kate Keahey 2, Tim Freeman 2, Frank Siebenlist 2 1 Indiana University, 2 Argonne National Lab

Advertisements

Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 GridPP 30, Glasgow, 26th March 2013.

Tier-1 Evolution and Futures GridPP 29, Oxford Ian Collier September 27 th 2012.

Cloud & Virtualisation Update at the RAL Tier 1 Ian Collier Andrew Lahiff STFC RAL Tier 1 HEPiX, Lincoln, NEBRASKA, 17 th October 2014.

HTCondor at the RAL Tier-1

Copyright 2009 FUJITSU TECHNOLOGY SOLUTIONS PRIMERGY Servers and Windows Server® 2008 R2 Benefit from an efficient, high performance and flexible platform.

Ceph vs Local Storage for Virtual Machine 26 th March 2015 HEPiX Spring 2015, Oxford Alexander Dibbo George Ryall, Ian Collier, Andrew Lahiff, Frazer Barnsley.

WLCG Cloud Traceability Working Group progress Ian Collier Pre-GDB Amsterdam 10th March 2015.

Idle virtual machine detection in FermiCloud Giovanni Franzini September 21, 2012 Scientific Computing Division Grid and Cloud Computing Department.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.

Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.

EGI-Engage Recent Experiences in Operational Security: Incident prevention and incident handling in the EGI and WLCG infrastructure.

Eucalyptus on FutureGrid: A case for Eucalyptus 3 Sharif Islam, Javier Diaz, Geoffrey Fox Gregor von Laszewski Indiana University.

Two Years of HTCondor at the RAL Tier-1

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

Virtualisation Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Copyright © 2011 EMC Corporation. All Rights Reserved. MODULE – 6 VIRTUALIZED DATA CENTER – DESKTOP AND APPLICATION 1.

HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

EGI-Engage Recent Experiences in Operational Security: Incident prevention and incident handling in the EGI and WLCG infrastructure.

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

Grid job submission using HTCondor Andrew Lahiff.

WLCG Cloud Traceability Working Group face to face report Ian Collier 11 February 2015.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.

An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.

Virtualisation & Cloud Computing at RAL Ian Collier- RAL Tier 1 HEPiX Prague 25 April 2012.

SC2012 Infrastructure Components Management Justin Cook (Data # 3) Principal Consultant, Systems Management Noel Fairclough (Data # 3) Consultant, Systems.

Virtualised Worker Nodes Where are we? What next? Tony Cass GDB /12/12.

EGI-Engage Recent Experiences in Operational Security: Incident prevention and incident handling in the EGI and WLCG infrastructure.

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.

Virtualisation at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Annecy, 23rd May 2014.

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.

Monitoring with InfluxDB & Grafana

HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.

CernVM-FS Infrastructure for EGI VOs Catalin Condurache - STFC RAL Tier1 EGI Webinar, 5 September 2013.

OpenNebula: Experience at SZTAKI Peter Kacsuk, Sandor Acs, Mark Gergely, Jozsef Kovacs MTA SZTAKI EGI CF Helsinki.

Ian Collier, STFC, Romain Wartel, CERN Maintaining Traceability in an Evolving Distributed Computing Environment Introduction Security.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

INFN OCCI implementation on Grid Infrastructure Michele Orrù INFN-CNAF OGF27, 13/10/ M.Orrù (INFN-CNAF) INFN OCCI implementation on Grid Infrastructure.

Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.

STFC in INDIGO DataCloud WP3 INDIGO DataCloud Kickoff Meeting Bologna April 2015 Ian Collier

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Federated Cloud and Software Vulnerabilities Linda Cornwall, STFC 20.

Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.

Instituto de Biocomputación y Física de Sistemas Complejos Cloud resources and BIFI activities in JRA2 Reunión JRU Española.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th.

Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

MCSA Windows Server 2012 Pass Upgrading Your Skills to MCSA Windows Server 2012 Exam By The Help Of Exams4Sure Get Complete File From

C Loomis (CNRS/LAL) and V. Floros (GRNET)

Virtualisation for NA49/NA61

Dag Toppe Larsen UiB/CERN CERN,

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

Dag Toppe Larsen UiB/CERN CERN,

ATLAS Cloud Operations

Moving from CREAM CE to ARC CE

SCD Cloud at STFC By Alexander Dibbo.

Virtualisation for NA49/NA61

HPEiX Spring RAL Site Report

Dynamic Web Page A dynamic web page is a kind of web page that has been prepared with fresh information (content and/or layout), for each individual viewing.

Virtualization in the gLite Grid Middleware software process

Management of Virtual Execution Environments 3 June 2008

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

Presentation transcript:

Cloud services at RAL, an Update 26 th March 2015 Spring HEPiX, Oxford George Ryall, Frazer Barnsley, Ian Collier, Alex Dibbo, Andrew Lahiff V2.1

Overview Background Our current set up Self-service test & development VMs Traceability, security, and logging Links with Quattor (Aquilon) Expansion of the batch farm into unused capacity What next

Background Began as small experiment 3 years ago –Initially using StratusLab & old worker nodes –Initially very quick and easy to get working –But fragile, and upgrades and customisations always harder Work until last spring was implemented by graduates on 6 month rotations –Disruptive & variable progress Worked well enough to prove usefulness Self service VMs proved very popular, though something of an exercise in managing expectations

In March 2014, we secured funding for our current hardware, I became involved –setting up the Ceph cluster A fresh technology evaluation led us to move to OpenNebula. In September secured first dedicated effort to build on the previous 2 ½ years of experiments to produce a service with a defined service level. Two staff, me full time and another half time. Background (2)

Just launched with a defined, if limited, service level for users across STFC. Integrated in to the existing Tier 1 configuration & monitoring frameworks (yet to establish cloud specific monitoring). Now have an extra member of staff working on the project, bringing us close to two full time equivalents. Where we are now

Our current set up OpenNebula based cloud with a Ceph storage backend This has 28 Hypervisors consisting of 892 cores and 3.4TB of RAM We have 750TB of raw storage in the supporting Ceph cluster (as seen in Alastair Dewhurst’s presentation yesterday, performance testing described in Alex Dibbo’s presentation this afternoon) This is all connected together at 10Gb/s Web Front End and headnode on VMs in our HyperV production virtulisation infrastructure Three node MariaDB/Galera cluster for DB (again on HyperV)

Self-service test & development VMs First use case to be exposed to users in a pre-production way. Provide members of the Scientific Computing Department (~160 people) with access to VMs on demand for development work. Quickly provides VMs to speed up the development cycle of various services and offer a testing environment for our software developers (<1 minute to a useable machine). Clear terms of service and defined level of service that is currently short of production..

A simple web frontend To provide easy access to these VMs we have developed a simple web front-end running on a VM This talks to the OpenNebula head node through its XML RPC interface Provides a simpler, more customised interface for our users than is available through OpenNebula’s sunstone interface

The web front end from a users perspective

User logs in with their organisation wide credentials (implemented using Kerberos)

The web front end from a users perspective The User is presented with a list of their current VMs, a button to launch more, and an option to view historical information

The web front end from a users perspective The User clicks to “Create Machine” (because they’re lazy they use our auto-generate name button)

The web front end from a users perspective The user is presented with a list of possible machine types to launch which is relevant to them This is accomplished using OpenNebula groups and active directory user properties. CPU and Memory are currently pre-set for each type, we can expand it later by request. We could offer a choice – but we suspect users, being users, will just select the most available with little thought.

The web front end from a users perspective The VM is listed as pending for about 20 seconds, whilst OpenNebula deploys it on a hypervisor

The web front end from a users perspective Once booted, the user can login with their credentials or they can SSH in with those same credentials

The web front end from a users perspective Once the user is done they click the delete button and from their perspective it goes away…

Traceability …Actually for traceability reasons (as seen in Ian Collier’s Tuesday afternoon presentation) we keep snapshots of the images for a short period of time. This allows us to allow us to investigate potential user abuse of short-lived VMs as well as being useful for debugging other issues. At VM instantiation, an OpenNebula hook creates a deferred snapshot to be executed when the machine is SHUTDOWN. A cron job runs daily to check all images are the right type and the age and deletes the relevant images.

Security Patching Just like any other machine in our infrastructure, VMs need to have the latest security updates applied in a timely manner. For Aquilon managed machines, this will be done with the rest of our infrastructure. The unmanaged images come with Yum auto-update and local Pakiti reporting turned on. Our user policy expressly prohibits disabling this. The next step is to monitor this.

Logging We require all our VMs running on our cloud to log They are configured, like the rest of our infrastructure, to use syslog to do this Again, disabling this is specifically prohibited in our terms of service. Again, the next step is to implement monitoring of compliance with this.

All of our infrastructure is configured using Quattor (As seen in this mornings presentation by James Adams) – investigating UGent developed OpenNebula Quattor component, using UGent Ceph component. We build updated managed and unmanaged images for users using Quattor to first install them and then we strip them back for the unmanaged images. Managed VMs are available, but we re-use hostnames. Rather than dynamically creating the hosts when managed VMs are launched, we use a hook when they are removed to make a call to Aquilon’s REST interface to reset their ‘personality’ and ‘domain’. Quattor (Aquilon) and Our Cloud

This work has already been presented by Andrew Lahiff and Ian Collier at ISGC – the content of the following slides has largely been provided by them. Much of it has also been presented at previous HEPiX’s. So the following is a brief refresher and update. Expanding the farm into cloud

Initial situation: partitioned resources: Worker nodes (batch system) & Hypervisors (cloud) Ideal situation: completely dynamic –If batch system busy but cloud not busy Expand batch system into the cloud –If cloud busy but batch system not busy Expand size of cloud, reduce amount of batch system resources cloudbatch cloudbatch Bursting the batch system into the cloud

This lead to an aspiration to Integrate cloud with batch system –First step: allow the batch system to expand into the cloud Avoid running additional third-party and/or complex services Leverage existing functionality in HTCondor as much as possible Proof-of-concept testing carried out with StratusLab in 2013 –Successfully ran ~11000 jobs from the LHC VOs This will ensure our private cloud is always used –LHC VOs can be depended upon to provide work

Bursting the batch system into the cloud Based on existing power management features of HTCondor Virtual machine instantiation –ClassAds for offline machines are sent to the collector when there are free resources in the cloud –Negotiator can match idle jobs to the offline machines –HTCondor rooster daemon notices this match & triggers creation of VMs

Bursting the batch system into the cloud Virtual machine lifetime –Managed by HTCondor on the VM itself. Configured to: Only start jobs when a health-check script is successful Only start new jobs for a specified time period Shuts down the machine after being idle for a specified period –Virtual worker nodes are drained when free resources on the cloud start to fall below a specified threshold

condor_collectorcondor_negotiator Worker nodes condor_startd condor_rooster Virtual worker nodes condor_startd ARC/CREAM CEs condor_schedd Central manager Offline machine ClassAds Draining

Bursting the batch system into the cloud Previously this was a short term experiment with StratusLab Ability to expand batch farm into our cloud is being integrated into our production batch system The challenge is to have a variable resource so closely bound to our batch service HTCondor makes it much easier – elegant support for dynamic resources But significant changes to monitoring –Moved to the condor health check – no Nagios on virtual WNs –This has in turn fed back in to the monitoring of bare metal WNs

What Next Consolidation of configuration, review of architecture and design decisions Development of new use cases for STFC Facilities (e.g. ISIS and CLF) Work as part of DataCloud H2020 project Work to host more Tier 1/WLCG services Continue work with members of the LOFAR project Engagement with non-HEP communities Start to engage with EGI Fed-cloud

Any Questions?