Workload management, virtualisation, clouds & multicore Andrew Lahiff.

Slides:

Advertisements

Similar presentations

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Advertisements

ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.

HEPiX Virtualisation Working Group Status, July 9 th 2010

CERN IT Department CH-1211 Genève 23 Switzerland t CERN-IT Plans on Virtualization Ian Bird On behalf of IT WLCG Workshop, 9 th July 2010.

HTCondor within the European Grid & in the Cloud

WLCG Cloud Traceability Working Group progress Ian Collier Pre-GDB Amsterdam 10th March 2015.

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.

SSC2 and Update on Multi-user Pilot Jobs Framework Mingchao Ma, STFC – RAL HEPSysMan Meeting 20/06/2008.

BESIII distributed computing and VMDIRAC

HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

Grid job submission using HTCondor Andrew Lahiff.

Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.

WNoDeS – Worker Nodes on Demand Service on EMI2 WNoDeS – Worker Nodes on Demand Service on EMI2 Local batch jobs can be run on both real and virtual execution.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Successful Common Projects: Structures and Processes WLCG Management.

WLCG Cloud Traceability Working Group face to face report Ian Collier 11 February 2015.

1 Resource Provisioning Overview Laurence Field 12 April 2015.

Cloud Status Laurence Field IT/SDC 09/09/2014. Cloud Date Title 2 SaaS PaaS IaaS VMs on demand.

Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.

1 The Adoption of Cloud Technology within the LHC Experiments Laurence Field IT/SDC 17/10/2014.

Virtualised Worker Nodes Where are we? What next? Tony Cass GDB /12/12.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN-IT Update Ian Bird On behalf of IT Multi-core and Virtualisation Workshop,

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.

WLCG Overview Board, September 3 rd 2010 P. Mato, P.Buncic Use of multi-core and virtualization technologies.

2012 Objectives for CernVM. PH/SFT Technical Group Meeting CernVM/Subprojects The R&D phase of the project has finished and we continue to work as part.

Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL

LCG Support for Pilot Jobs John Gordon, STFC GDB December 2 nd 2009.

LCG Pilot Jobs and glexec John Gordon.

DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.

WLCG Authentication & Authorisation LHCOPN/LHCONE Rome, 29 April 2014 David Kelsey STFC/RAL.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI VM Management Chair: Alexander Papaspyrou 2/25/

1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.

HEPiX Virtualisation Working Group Status, February 10 th 2010 April 21 st 2010 May 12 th 2010.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Draft Security Virtualisation Policy (for Romain Wartel – CERN) EGI Technical.

LCG Pilot Jobs + glexec John Gordon, STFC-RAL GDB 7 December 2007.

Ian Collier, STFC, Romain Wartel, CERN Maintaining Traceability in an Evolving Distributed Computing Environment Introduction Security.

Accounting Review Summary from the pre-GDB related to CPU (wallclock) accounting Julia Andreeva CERN-IT GDB 13th April

LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.

Accounting John Gordon WLC Workshop 2016, Lisbon.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.

INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.

Outcome should be a documented strategy Not everything needs to go back to square one! – Some things work! – Some work has already been (is being) done.

Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th.

WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,

Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.

Using the CMS Higher Level Trigger Farm as a Cloud Resource David Colling Imperial College London.

The HEPiX Virtualisation Working Group Towards a Grid of Clouds Tony Cass CHEP 2012 May 24 th 2012.

3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.

HEPiX Virtualisation working group Andrea Chierici INFN-CNAF Workshop CCR 2010.

Review of the WLCG experiments compute plans

Laurence Field IT/SDC Cloud Activity Coordination

Running LHC jobs using Kubernetes

Sviluppi in ambito WLCG Highlights

HEPiX Virtualisation working group

Dag Toppe Larsen UiB/CERN CERN,

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

Dag Toppe Larsen UiB/CERN CERN,

ATLAS Cloud Operations

StratusLab Final Periodic Review

StratusLab Final Periodic Review

DIRAC services.

CREAM-CE/HTCondor site

WLCG Collaboration Workshop;

VMDIRAC status Vanessa HAMAR CC-IN2P3.

Presentation transcript:

Workload management, virtualisation, clouds & multicore Andrew Lahiff

Job submission LHC VOs using pilot frameworks –ATLAS: PanDA –CMS: glideinWMS, starting to use PanDA for some analysis –LHCb: DIRAC Recent/current work –Instantiate VMs instead of submitting pilot jobs –Multicore jobs E.g. improvements to glideinWMS so that single core jobs can be scheduled efficiently within in multicore pilots What about non-LHC VOs that use the WMS? –Especially after it’s fully decommissioned by the LHC VOs (soon) 2

Virtualisation Virtualisation of services at sites –Has been done for a long time now –Improve resource utilization by putting multiple underutilized machines onto single hypervisors –Simplify management by supporting live migration –… Generally transparent to the experiments Batch system virtualization –Not quite so common, but some sites do this INFN-CNAF has had production virtual worker nodes for a number of years 3

Clouds - Security Trusted images –Policy proposed by HEPiX working group Root access –No root access needed for end uses in WLCG VOs –Root access restricted to the user who instantiates the VMs (e.g. pilot factory) Traceability –Same level of traceability back to the end user that we have in the grid –Can/should glexec still be used? 4

Clouds - Contextualization Contextualization: way to pass data to the image at instantiation time –E.g. pass a proxy to the image Different choices currently being used –amiconfig (e.g. CERNVM) –BoxGrinder (e.g. CMS HLT cloud) –CloudInit Is using a common mechanism important? –Probably it’s not a big deal 5

Clouds – VM Instantiation Different interfaces –Dominated by EC2 –EGI federated cloud using & recommending OCCI Contextualization not supported VOs using/could use abstract API –libCloud (DIRAC) –Deltacloud (supported by Condor, could be used by CMS) 6

Clouds – VM Duration Some VOs want long-lived VMs –Need to be able to stop them when they are no longer being used Graceful stopping of VMs –Mechanism to publish information to VM user about when the VM will terminate 7

Clouds – VM Scheduling “Fairshare-like” resource sharing –Want to avoid static partitioning –Allow VOs to make sure of resources not being used by others Issues –Cloud platforms usually have very basic scheduling –No queuing system like batch systems –Graceful stopping of VMs important Economic models –Lots of discussion –E.g. VOs given credits. Price of a VM increases with its duration and number of VMs owned by the VO. 8

Clouds – Accounting General agreement about using wall-clock time accounting in clouds APEL has demonstrated its ability to do the reporting VM benchmarking –How to ensure consistency between sites? 9

Multicore Experiments have been working on multicore for a long time now –Not yet used routinely for “real” production work Issues from site perspective –Whole node vs multicore –Static vs dynamic allocation Partitioning of resources can prevent full utilization of farm –Queues that give access to nodes with fixed numbers of cores Fixed configuration of number of cores in the local queue and job –Dynamic resource allocation LRMS schedules a dynamic number of free cores Jobs/pilots specify requirements (cores/whole-node, RAM) LRMS informs jobs of allocated number of cores Shared resources with single core queues 10