Cloud & Virtualisation Update at the RAL Tier 1 Ian Collier Andrew Lahiff STFC RAL Tier 1 HEPiX, Lincoln, NEBRASKA, 17 th October 2014.

Slides:

Advertisements

Similar presentations

Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 GridPP 30, Glasgow, 26th March 2013.

Advertisements

Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.

Tier-1 Evolution and Futures GridPP 29, Oxford Ian Collier September 27 th 2012.

HEPiX Virtualisation Working Group Status, July 9 th 2010

Ceph vs Local Storage for Virtual Machine 26 th March 2015 HEPiX Spring 2015, Oxford Alexander Dibbo George Ryall, Ian Collier, Andrew Lahiff, Frazer Barnsley.

WLCG Cloud Traceability Working Group progress Ian Collier Pre-GDB Amsterdam 10th March 2015.

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

Testing as a Service with HammerCloud Ramón Medrano Llamas CERN, IT-SDC

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

EGI-Engage Recent Experiences in Operational Security: Incident prevention and incident handling in the EGI and WLCG infrastructure.

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

Virtualisation Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013.

Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks C. Loomis (CNRS/LAL) M.-E. Bégin (SixSq.

Cloud services at RAL, an Update 26 th March 2015 Spring HEPiX, Oxford George Ryall, Frazer Barnsley, Ian Collier, Alex Dibbo, Andrew Lahiff V2.1.

QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.

WLCG Cloud Traceability Working Group face to face report Ian Collier 11 February 2015.

1 Resource Provisioning Overview Laurence Field 12 April 2015.

An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.

Virtualisation & Cloud Computing at RAL Ian Collier- RAL Tier 1 HEPiX Prague 25 April 2012.

RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.

Virtualised Worker Nodes Where are we? What next? Tony Cass GDB /12/12.

Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.

Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.

WLCG Overview Board, September 3 rd 2010 P. Mato, P.Buncic Use of multi-core and virtualization technologies.

2012 Objectives for CernVM. PH/SFT Technical Group Meeting CernVM/Subprojects The R&D phase of the project has finished and we continue to work as part.

Virtualisation at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Annecy, 23rd May 2014.

A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.

NA61/NA49 virtualisation: status and plans Dag Toppe Larsen Budapest

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES CVMFS deployment status Ian Collier – STFC Stefan Roiser – CERN.

Monitoring with InfluxDB & Grafana

HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.

1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.

CernVM-FS Infrastructure for EGI VOs Catalin Condurache - STFC RAL Tier1 EGI Webinar, 5 September 2013.

Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.

The GridPP DIRAC project DIRAC for non-LHC communities.

Ian Collier, STFC, Romain Wartel, CERN Maintaining Traceability in an Evolving Distributed Computing Environment Introduction Security.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

StratusLab is co-funded by the European Community’s Seventh Framework Programme (Capacities) Grant Agreement INFSO-RI Demonstration StratusLab First.

Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.

STFC in INDIGO DataCloud WP3 INDIGO DataCloud Kickoff Meeting Bologna April 2015 Ian Collier

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.

Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th.

Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.

Virtualisation & Containers (A RAL Perspective)

Running LHC jobs using Kubernetes

Virtualisation for NA49/NA61

NA61/NA49 virtualisation:

Dag Toppe Larsen UiB/CERN CERN,

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

Dag Toppe Larsen UiB/CERN CERN,

ATLAS Cloud Operations

HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL

StratusLab Final Periodic Review

StratusLab Final Periodic Review

SCD Cloud at STFC By Alexander Dibbo.

Virtualisation for NA49/NA61

HPEiX Spring RAL Site Report

WLCG Collaboration Workshop;

Presentation transcript:

Cloud & Virtualisation Update at the RAL Tier 1 Ian Collier Andrew Lahiff STFC RAL Tier 1 HEPiX, Lincoln, NEBRASKA, 17 th October 2014

RAL Tier 1 Recap Hyper-V Across STFC Scientific Computing Department Cloud Batch farm related virtualisation

Recap - HyperV at RAL Tier began investigating virtualisation platform Settled on Hyper-V –A challenge for Linux admins Production local storage Hypervisors in 2011 Shared EqualLogic iSCSI storage a year or so later –As Martin mentioned it has recently been problematic Even with some problems, life – and service provisioning - much easier –Matter of minutes to hours to provision new host See previous talks for more detail

Hyper-V Issues Challenge for admins without professional Windows experience to manage –Especially when there are difficult issues –Support from overstretched Windows experts has been limited Recent shared storage issues saw hypervisors losing contact with shared volumes –Our attempts to recover made it worse not better (see above re Windows expertise) –Underlying cause may have been subtle firmware clashes –Rebuilding very carefully –But have just hired someone with specific expertise in this area 20/05/2014HEPiX Spring RAL Site Report

Virtualisation & RAL Context at RAL Hyper-V Across STFC Scientific Computing Department Cloud Batch farm related virtualisation

Context at RAL Tier STFC has run a review of corporate IT services Working group on virtualisation Several separate Hyper-V deployments across organisation –Tier 1 has by far the largest Also some VMWare Little shared effort or experience An intention emerging for common configurations and experience sharing

Virtualisation & RAL Context at RAL Hyper-V Across STFC Scientific Computing Department Cloud Dynamically provisioned worker nodes

Recap - SCD Cloud Began as small experiment 2 1/2 years ago –Initially using StratusLab & old worker nodes –Very quick and easy to get working Deployment until now implemented by graduates on 6 month rotation –Disruptive & variable progress Worked well enough to prove usefulness Something of an exercise in managing expectations

SCD Cloud Future Develop to full supported service to users across STFC IaaS upon which we can offer PaaS –One platform could ultimately be the Tier 1 itself –Integrating cloud resources in to Tier 1 grid work Participation in cloud federations Carrying out fresh technology evaluation. –Things have moved on since we started with StratusLab –Currently favour Opennebula

SCD Cloud March this year secured funding for 30 well specified hypervisors & 30 storage servers –~1000 cores (plus ~800 in old WNs) –~1PB raw storage Using OpenNebula –Collaborating with UGent on configuration tools –Ceph for image store and running images –AD as initial primary authentication 2 staff began work last month one full time, one half time

SCD Cloud Initially private cloud Closely coupled to Tier 1 batch – see later Provide support for Scientific Computing Department projects –Eg Horizon 2020 projects coming on line mid next year Also potential use cases with other STFC departments –Parallel to data services already provided to Diamond & ISIS

SCD Cloud Ceph backend –Been investigating for a couple of years –Have separate test bed (half a generation of standard disk servers (1.7PB raw) being tested for grid data storage –Will use for image store, running images, and possibly a data service Summer student worked on benchmarking & tuning –Investigated moving journals to SSDs –Although OSD benchmarks were much faster, overall performance no different –Turned out it was caches in the raid controllers –Will probably run on those systems with out the journal

Virtualisation & RAL Context at RAL Hyper-V Across STFC Scientific Computing Department Cloud Batch farm related virtualisation

Bursting the batch system into the cloud In Annecy described leveraging HTCondor features designed for power saving to dynamically burst batch work in to our private cloud Aims –Integrate cloud with batch system –First step: allow the batch system to expand into the cloud Avoid running additional third-party and/or complex services Leverage existing functionality in HTCondor as much as possible Proof-of-concept testing carried out with StratusLab cloud –Successfully ran ~11000 jobs from the LHC VOs This allows us to be sure good utilization of our private cloud as LHC VOs can always provide work

The Vacuum Model “Vacuum” model is becoming popular in the UK –Alternative to CE + batch system or clouds –No centrally-submitted pilot job or requests for VMs –VMs appear by “spontaneous production in the vacuum” –VMs run the appropriate pilot framework to pull down jobs –See Andrew McNab’s talk in Annecy Can we incorporate the vacuum model into our existing batch system? –HTCondor has a “VM universe” for managing VMs

The Vacuum Model & HTCondor Considered using fetch work hooks on the worker nodes –Unfortunately fairshares would not be respected Now using a scheduler universe job which runs permanently & submits jobs (VMs) as necessary –One for each VO –If no work or an error, only submit 1-2 VMs every hour –If VMs are running real work, submit more VMs –Fairshares are maintained since the negotiator decides what VMs to run

The Vacuum Model & HTCondor Running VMs using HTCondor –Using the CernVM 3 image ISO image only 15MB in size, CVMFS provides root partition –Using a job hook for: Creating a 20 GB sparse hard disk image file for the CVMFS cache Creating a 40 GB sparse hard disk image file to provide additional disk space mounted as /scratch on the VM Contextualization Setting up NFS so that the pilot log files can be written on the worker node’s disk –Using condor_chirp for: Adding information about the status of the pilot to the job’s ClassAd

The Vacuum Model & HTCondor Testing so far includes –Successfully running regular SAM tests from the GridPP DIRAC instance –ATLAS validation jobs First attempt at running jobs from ATLAS. Need to investigate the cause of the failures: Payload ran out of memory Error in copying the file from job workdir to local SE

Integrating Batch & Cloud All these developments allow us more flexibility in managing & delivering Tier 1 computing services

Virtualisation & RAL Context at RAL Hyper-V Services Platform Scientific Computing Department Cloud Batch farm related virtualisation

Summary Provisioning already much more responsive. Private cloud has developed from a small experiment to the beginning of a real service –With constrained effort - Slower than we would have liked –The prototype platform has been well used Demonstrated transparentl expansion of batch farm into cloud. Proof of concept vacuum model implementation using HTCondor Whole Tier 1 service becoming more flexible

Questions ?