Virtualisation at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Annecy, 23rd May 2014.

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 GridPP 30, Glasgow, 26th March 2013.

Tier-1 Evolution and Futures GridPP 29, Oxford Ian Collier September 27 th 2012.

System Center 2012 R2 Overview

CERN IT Department CH-1211 Genève 23 Switzerland t CERN-IT Plans on Virtualization Ian Bird On behalf of IT WLCG Workshop, 9 th July 2010.

Cloud & Virtualisation Update at the RAL Tier 1 Ian Collier Andrew Lahiff STFC RAL Tier 1 HEPiX, Lincoln, NEBRASKA, 17 th October 2014.

HTCondor at the RAL Tier-1

S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.

Copyright 2009 FUJITSU TECHNOLOGY SOLUTIONS PRIMERGY Servers and Windows Server® 2008 R2 Benefit from an efficient, high performance and flexible platform.

WLCG Cloud Traceability Working Group progress Ian Collier Pre-GDB Amsterdam 10th March 2015.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)

Virtualisation Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013.

UI and Data Entry UI and Data Entry Front-End Business Logic Mid-Tier Data Store Back-End.

HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

M.A.Doman Short video intro Model for enabling the delivery of computing as a SERVICE.

Get More out of SQL Server 2012 in the Microsoft Private Cloud environment Steven Wort, Xin Jin Microsoft Corporation.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

Cloud services at RAL, an Update 26 th March 2015 Spring HEPiX, Oxford George Ryall, Frazer Barnsley, Ian Collier, Alex Dibbo, Andrew Lahiff V2.1.

HTCondor and Beyond: Research Computer at Syracuse University Eric Sedore ACIO – Information Technology and Services.

From Virtualization Management to Private Cloud with SCVMM 2012 Dan Stolts Sr. IT Pro Evangelist Microsoft Corporation

WLCG Cloud Traceability Working Group face to face report Ian Collier 11 February 2015.

Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.

An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.

Virtualisation & Cloud Computing at RAL Ian Collier- RAL Tier 1 HEPiX Prague 25 April 2012.

RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.

SC2012 Infrastructure Components Management Justin Cook (Data # 3) Principal Consultant, Systems Management Noel Fairclough (Data # 3) Consultant, Systems.

Virtualised Worker Nodes Where are we? What next? Tony Cass GDB /12/12.

Windows Azure Virtual Machines Anton Boyko. A Continuous Offering From Private to Public Cloud.

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.

PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.

INFSO-RI Enabling Grids for E-sciencE /10/20054th EGEE Conference - Pisa1 gLite Configuration and Deployment Models JRA1 Integration.

CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Juraj Sucik, Michal Kwiatek, Rafal.

HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.

Ian Collier, STFC, Romain Wartel, CERN Maintaining Traceability in an Evolving Distributed Computing Environment Introduction Security.

EGI-InSPIRE RI EGI Webinar EGI-InSPIRE RI Porting your application to the EGI Federated Cloud 17 Feb

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Improving resilience of T0 grid services Manuel Guijarro.

Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.

STFC in INDIGO DataCloud WP3 INDIGO DataCloud Kickoff Meeting Bologna April 2015 Ian Collier

Instituto de Biocomputación y Física de Sistemas Complejos Cloud resources and BIFI activities in JRA2 Reunión JRU Española.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th.

Farming Andrea Chierici CNAF Review Current situation.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

Microsoft Virtual Academy. Microsoft Virtual Academy First HalfSecond Half (01) Introduction to Microsoft Virtualization(05) Hyper-V Management (02) Hyper-V.

CERN IT Department CH-1211 Genève 23 Switzerland M.Schröder, Hepix Vancouver 2011 OCS Inventory at CERN Matthias Schröder (IT-OIS)

RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.

C Loomis (CNRS/LAL) and V. Floros (GRNET)

Virtualisation for NA49/NA61

Dag Toppe Larsen UiB/CERN CERN,

Dag Toppe Larsen UiB/CERN CERN,

ATLAS Cloud Operations

HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL

StratusLab Final Periodic Review

StratusLab Final Periodic Review

SCD Cloud at STFC By Alexander Dibbo.

Virtualisation for NA49/NA61

HPEiX Spring RAL Site Report

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

Presentation transcript:

Virtualisation at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Annecy, 23rd May 2014

RAL Tiere 1 Context at RAL Tier 1 Hyper-V Services Platform Scientific Computing Department Cloud Dynamically provisioned worker nodes

Context at RAL Tier 1 Historically requests for systems went to fabric team –Procure new HW – could take months –Scavenge old WNs – could take days/weeks – and they are often unreliable Kickstarts & scripts needed customising for each system Not very dynamic For development systems many users simply run VMs on their desktops –Hard to track –Often not well managed - risky

Evolution at RAL Tier 1 Many elements play their part –Configuration management system Quattor (introduced in 2009) abstracts hardware from os from payload, automates most deployment Makes migration & upgrades much easier (still not completely trivial) –Databases store hardware info – scripts feeding configuration management system Provisioning new hardware much faster With Aquilon it is easier still

Virtualisation & RAL Context at RAL Hyper-V Services Platform Scientific Computing Department Cloud Dynamically provisioned worker nodes

Hyper-V Platform Over last three years –Initially Local storage only in production –iSCSI (EqualLogic) shared storage in production 1 year ago –~250 VMs Provisioning transformed –Much more responsive to changing requirements –Self service basis – requires training all admins in using management tools – but this proved not so hard

Virtualisation Three production clusters with shared storage, several local storage hypervisors –Windows Server Hyper-V (2012 being tested now) –Clusters distributed between two buildings However, saw issues with VMs –Migration problems triggered when we needed to shut down cluster for power intervention Re-building the shared-storage clusters from scratch –New configuration of networking and hardware –Windows Server 2012 and Hyper-V –Currently migrating most VMs back to local storage systems 20/05/2014HEPiX Spring RAL Site Report

Hyper-V Platform Most services virtualised now –Exceptions: top bdii, ganglia, Oracle Internal databases & monitoring systems Also test beds (batch system, CEs, bdiis etc) Move to production mostly smooth –Team had good period to become familiar with environment & tools Shared storage placement & migration managed by production team.

Hyper-V Platform When a Tier 1 admin needs to set up a new machine all they have to request is a DNS entry –Everything else they do themselves –Placement of production system by production team Maintenance of underlying hardware platform can usually be done with (almost) no service interruption. This is already much, much better – especially more responsive – than what went before. Behaved well in power events

Hyper-V Platform However, Windows administration is not friction or effort free (we are mostly Linux admins….) –Troubleshooting means even more learning –Share management server with STFC corporate IT – but they do not have resources to support our use –In fact we are largest Hyper-V users on site Hyper-V presented problems supporting Linux –None show stoppers, but they drained effort and limited use –Ease of management otherwise compensates for now –Much better with latest SL (5.9 & 6.4) Since we began open source tools have moved on –We are not wedded to Hyper-V –Realistically will run it for a while

Virtualisation & RAL Context at RAL Hyper-V Services Platform Scientific Computing Department Cloud Dynamically provisioned worker nodes

SCD Cloud Prototype E-Science/Scientific Computing Department cloud platform* Began as small experiment 2 years ago Using StratusLab –Share Quattor configuration templates with other sites –Very quick and easy to get working –But has been a moving target as it develops Deployment implemented by graduates on 6 month rotation –Disruptive & variable progress Worked well enough *Department changed name

SCD Cloud Resources –Initially 20 (very) old worker nodes ~80 cores Filled up very quickly 2 years ago added 120 cores in new Dell R410s – and also a few more old WNs Also added half a generation of retired WNs ~800 cores Retired disk servers for shared storage –This has been enough to test a number of use cases

SCD Cloud Last summer established principle that we were ready to move from best effort prototype to a service Use cases within and beyond Tier 1 coming in to focus –Exposing cloud APIs to LHC VOs –Platform for SCD projects (eg H2020 projects) –Compute services for other STFC Departments (ISIS, Diamond) –Self service IaaS across STFC Agreed budget for 1FTE to April 2015 (end of GridPP 4) In February we found someone great –At the start of this month they took a different job  In March we managed to ‘catch’ £300K of underspend –~1PB of disk (for Ceph backend) –~1000 cores of compute ~3.5TB RAM

SCD Cloud Future Develop to full supported service to users across STFC IaaS upon which we can offer PaaS –One platform could ultimately be the Tier 1 itself –Integrating cloud resources in to Tier 1 grid work Participation in cloud federations Carrying out fresh technology evaluation. –Things have moved on since we started with StratusLab –Currently favour Opennebula

Virtualisation & RAL Context at RAL Hyper-V Services Platform Scientific Computing Department Cloud Dynamically provisioned worker nodes

Bursting the batch system into the cloud Aims –Integrate cloud with batch system, eventually without partitioned resources –First step: allow the batch system to expand into the cloud Avoid running additional third-party and/or complex services Leverage existing functionality in HTCondor as much as possible Should be as simple as possible Proof-of-concept testing carried out with StratusLab cloud –Successfully ran ~11000 jobs from the LHC VOs

Power management in HTCondor Existing functionality for powering down idle machines & waking them when required –Entering a low power state HIBERNATE expression can define when a slot is ready to enter a low power state When true for all slots, machine will go into the specified low power state –Machines in a low power state are “offline” The collector can keep offline ClassAds –Returning from a low power state condor_rooster daemon responsible for waking up hibernating machines By default will send UDP Wake On LAN –Important feature: this behaviour can be replaced by a user-defined script –When there are idle jobs Negotiator can match jobs to an offline ClassAd condor_rooster daemon notices this match & wakes up the machine

Provisioning worker nodes Our method: extend condor_rooster to provision VMs What we did –Send an appropriate offline ClassAd to the collector Hostname used is a random string Represents a class of VMs, rather than specific machines –condor_rooster Configure to run appropriate command to instantiate a VM –When there are idle jobs Negotiator can match jobs to the offline ClassAd condor_rooster notices this match & instantiates a VM –VM lifetime managed by HTCondor on the VM itself START expression modified so that jobs can only start for a limited time HIBERNATE expression set to shutdown the VM after being idle for too long

Observations Our VMs were almost exact clones of normal WNs Including all monitoring Condor may deal well with dynamic resources Nagios does not When we thought about it we realised almost all the WN monitoring was unnecessary on virtualised WNs. Really just need the health check hocked in to startd With minimal tuning efficiency loss was 4-9% All in all a very successful first step.

Virtualisation & RAL Context at RAL Hyper-V Services Platform Scientific Computing Department Cloud Dynamically provisioned worker nodes

Summary Using range of technologies Many ways our provisioning & workflows have become more responsive. Private cloud has developed from a small experiment to the beginning of a real service –With constrained effort –Slower than we would have liked –The prototype platform has been well used Have demonstrated we can transparently and simply expand batch farm into our cloud. Ready to start making much larger changes.