CMS Experience Provisioning Cloud Resources with GlideinWMS Anthony Tiradani HTCondor Week 2015 20 May 2015.

Slides:

Advertisements

Similar presentations

SLA-Oriented Resource Provisioning for Cloud Computing

Advertisements

System Center 2012 R2 Overview

4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.

Xrootd and clouds Doug Benjamin Duke University. Introduction Cloud computing is here to stay – likely more than just Hype (Gartner Research Hype Cycle.

ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.

Cloud Computing to Satisfy Peak Capacity Needs Case Study.

CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.

CMS Diverse use of clouds David Colling

An Introduction to Cloud Computing. The challenge Add new services for your users quickly and cost effectively.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

Minerva Infrastructure Meeting – October 04, 2011.

Copyright © 2010 Platform Computing Corporation. All Rights Reserved.1 The CERN Cloud Computing Project William Lu, Ph.D. Platform Computing.

SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.

Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over the Internet. Cloud is the metaphor for.

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)

glideinWMS: Quick Facts  glideinWMS is an open-source Fermilab Computing Sector product driven by CMS  Heavy reliance on HTCondor from UW Madison and.

OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.

M.A.Doman Short video intro Model for enabling the delivery of computing as a SERVICE.

BINP/GCF Status Report BINP LCG Site Registration Oct 2009

Presented by: Sanketh Beerabbi University of Central Florida COP Cloud Computing.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)

BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.

Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April

BESIII Production with Distributed Computing Xiaomei Zhang, Tian Yan, Xianghu Zhao Institute of High Energy Physics, Chinese Academy of Sciences, Beijing.

1. Maria Girone, CERN  Q WLCG Resource Utilization  Commissioning the HLT for data reprocessing and MC production  Preparing for Run II  Data.

WNoDeS – Worker Nodes on Demand Service on EMI2 WNoDeS – Worker Nodes on Demand Service on EMI2 Local batch jobs can be run on both real and virtual execution.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Successful Common Projects: Structures and Processes WLCG Management.

GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.

1 Resource Provisioning Overview Laurence Field 12 April 2015.

Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center February.

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

Claudio Grandi INFN Bologna CMS Computing Model Evolution Claudio Grandi INFN Bologna On behalf of the CMS Collaboration.

The CMS CERN Analysis Facility (CAF) Peter Kreuzer (RWTH Aachen) - Stephen Gowdy (CERN), Jose Afonso Sanches (UERJ Brazil) on behalf.

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL

Tim Bell 04/07/2013 Intel Openlab Briefing2.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

Web Technologies Lecture 13 Introduction to cloud computing.

Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.

April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.

HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

ATLAS FroNTier cache consistency stress testing David Front Weizmann Institute 1September 2009 ATLASFroNTier chache consistency stress testing.

Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1 how to profit of the ATLAS HLT farm during the LS1 & after Sergio Ballestrero.

© 2014 kCura. All rights reserved. vCloud Hybrid Services VMUG

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.

Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

Using the CMS Higher Level Trigger Farm as a Cloud Resource David Colling Imperial College London.

Parag Mhashilkar (Fermi National Accelerator Laboratory)

3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.

Canadian Bioinformatics Workshops

Review of the WLCG experiments compute plans

HTCondor Annex (There are many clouds like it, but this one is mine.)

Running LHC jobs using Kubernetes

AWS Integration in Distributed Computing

Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")

ATLAS Cloud Operations

Software Defined Storage

An Introduction to Cloud Computing

Provisioning 160,000 cores with HEPCloud at SC17

“The HEP Cloud Facility: elastic computing for High Energy Physics” Outline Computing Facilities for HEP need to evolve to address the new needs of the.

Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)

Статус ГРИД-кластера ИЯФ СО РАН.

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

Presentation transcript:

CMS Experience Provisioning Cloud Resources with GlideinWMS Anthony Tiradani HTCondor Week May 2015

glideinWMS Quick Facts glideinWMS is an open-source Fermilab Computing Sector product driven by CMS Heavy reliance on HTCondor from UW Madison and we work closely with them

GlideinWMS Overview VO Infrastructure Cloud Site glideinWMS VM Job glideinWMS Glidein Factory, WMS Pool glideinWMS Glidein Factory, WMS Pool bootstrap HTCondor Startd HTCondor Central Manager HTCondor Central Manager Grid Site Worker Node glidein HTCondor Startd HTCondor Scheduler VO Frontend

CMS Use of CERN AI CERN AI: CERN Agile Infrasructure T0: (Tier-0) CERN Data Center traditionally where first pass data processing takes place (See Dave Mason’s talk for details) T0 completely moved from LSF to AI (OpenStack) Approximately 9000 cores now ( -> 15000=~220 kHS06) 8 core/ 16GB VMs Images built by CMS, using CERN-IT automated build system. VM are not very dynamic but almost static (1 month duration) Resources are part of the T0 Pool and shared with other uses. However, T0 runs with very high priority and so will push out other users quickly if needed.

GlideinWMS & CERN AI VO Infrastructure Cloud Site glideinWMS VM HTCondor Scheduler Job glideinWMS Glidein Factory, WMS Pool glideinWMS Glidein Factory, WMS Pool VO Frontend bootstrap HTCondor Startd HTCondor Central Manager HTCondor Central Manager Use The EC2 API AI had stability issues, mostly fixed now VMs have ~1 month lifetime Still suffer from long ramp up times Due to resource scarcity, danger of losing resources if they are released Results have been mostly favorable

CMS HLT as a Cloud HLT: High Level Trigger farm The HLT is a considerable CPU resource. At startup: – Dell C6100 : 288 nodes, type = 12 cores 24GB ram (will be decommissioned end of 2015 and replaced) – Dell C6220 : 256 nodes, type = 16 cores 32GB ram (will be decommissioned end of 2016 and replaced) – Megware : 360 nodes, type =24 cores 64GB ram (are the brand new machines) No usable disk mass storage (only local disk to each machine) 60Gb/s network connection to CERN IT When needed to be the HLT it must be the HLT and alternative use must not interfere – However when not the HLT this is an extremely valuable resource. Cloud solution chosen (based on OpenStack) with VMs that can be killed completely if needed.

CMS HLT as a Cloud Even with expected overheads we can expect typically ~6 hours of usable time between runs. But this time is randomly distributed. So anticipate running ~2hour jobs, but this will be adjusted as we gain experience. Even during fills use of the HLT is not constant so should be possible to sometimes use part of the HLT even during data taking.

CMS HLT as a Cloud So we have a 3 level plan: Start by using HLT as a Cloud during Technical stops. If that works … Start to use the HLT as a Cloud between fills. If that works … Start to use parts of the HLT as a Cloud during fills at low luminosity. However, in this model the HLT team must be in control. So a new tool, Cloud-igniter, used by the online team to start and kill VMs rather than requests from coming from the factory. New Vms connect to the Global pool.

GlideinWMS & CMS HLT VO Infrastructure Cloud Site glideinWMS VM HTCondor Scheduler Job glideinWMS Glidein Factory, WMS Pool glideinWMS Glidein Factory, WMS Pool VO Frontend bootstrap HTCondor Startd HTCondor Central Manager HTCondor Central Manager HLT HLT VM HTCondor Startd GlideinWMS does NOT provision VMS HLT administrators provision and tear down VMs according to HLT availability VMs contain a vanilla HTCondor Startd with appropriate configuration to join the glideinWMS pool statically Squid servers provided by HLT admins Workflow planning required

CMS HLT as a Cloud Can start 1200 cores in ~10 minutes with EC2 API EC2 API faster than NOVA API

CMS HLT as a Cloud Can shutdown 700 cores in ~20 minutes with EC2 API – in reality, HLT admins hard kill VMs. Takes ~1 minute for entire cluster

Institutional Clouds Have been working with clouds in China, Italy and the UK. All OpenStack. Have run user analysis using GlideinWMS installed in Italy and the UK. Have mainly tested Institutional Clouds using the “traditional” CMS cloud approach. However, have carried tests using other tools

Fermilab OneFacility Project The goal is to transparently incorporate all resources available to Fermilab into one coherent facility Experiments should be able to perform the full spectrum of computing work on resources regardless of the “location” of the resources. Include commercial and community cloud resources Include “allocation based” resources via BOSCO

Fermilab OneFacility Project Initial work will focus on several use cases: – Nova use case tests sustained data processing in the cloud over the course of several months – CMS use case tests bursting into the cloud using many resources for a short period – Other use cases will explore specialized hardware needs

Cloud Bursting Pilot project is to “burst” into AWS Want to use ~56K cores (m3.2xlarge) for a month (8 vCPUs, 30GiB RAM, 160GB SSD) Initial phase will be provisioned through the FNAL Tier-1 as part of the OneFacility project See Sanjay Padhi’s talk for more details Would like to acknowledge Michael Ernst and ATLAS for initial work with AWS and sharing lessons learned

Cloud Challenges Grid sites provide certain infrastructure Cloud sites provide the tools to build infrastructure Data placement Authentication/Authorization VM image management Cost management

HTCondor Wishlist Make network a first class citizen – Network bandwidth management one of the challenges facing the OneFacility project for all use cases – Counters for how much data is being transferred in and out, knobs to restrict traffic, etc. Native cloud protocols – Using EC2 API now – Native nova support for OpenStack (EC2 API moving out of the core and into third party development/support) – Google Compute Engine support – Microsoft Azure support?

HTCondor Wishlist Better integration with cloud tooling – EC2 API good if a longer ramp-up is tolerable – Not so good if you need to burst now – Need to build grid infrastructure in the cloud, would be nice to be able to “orchestrate” the architecture on demand (OpenStack Heat, AWS CloudFormation) Provide a better view of the total cloud usage orchestrated by HTCondor – Classads that contain (as reported by the cloud infrstructure): total computing time total storage usage Total inbound and outbound network usage Usage statistics from other services offered by the infrastructures LogStash filters for all HTCondor logs

Acknowledgements David Colling (Imperial College) for use of CHEP talk CMS support: Anastasios Andronidis, Daniela Bauer, Olivier Chaze, David Colling, Marc Dobson, Maria Girone, Claudio Grandi, Adam Huffman, Dirk Hufnagel, Farrukh Aftab Khan, Andrew Lahiff, Alison McCrae, Massimo Sgaravatto, Duncan Rand, Xiaomei Zhang + support from many other people ATLAS: Michael Ernst and others for sharing knowledge and experiences glideinWMS team HTCondor Team

Questions?