Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.

Slides:



Advertisements
Similar presentations
Alessandro Di Girolamo CERN IT-SDC-OL  Many reports in the last weeks: 4 th July ATLAS DAQ/HLT Software and Operations (
Advertisements

System Center 2012 R2 Overview
1 13 Nov 2012 John Hover Virtualized/Cloud-based Facility Resources John Hover US ATLAS Distributed Facility Workshop UC Santa Cruz.
ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.
1 October 2013 APF Summary Oct 2013 John Hover John Hover.
Stephen Dine Datasource Consulting, LLC Tuesday August 4, 2009.
CMS Diverse use of clouds David Colling
CMS Experience Provisioning Cloud Resources with GlideinWMS Anthony Tiradani HTCondor Week May 2015.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
Testing PanDA at ORNL Danila Oleynik University of Texas at Arlington / JINR PanDA UTA 3-4 of September 2013.
CERN IT Department CH-1211 Geneva 23 Switzerland t T0 report WLCG operations Workshop Barcelona, 07/07/2014 Maite Barroso, CERN IT.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
Testing as a Service with HammerCloud Ramón Medrano Llamas CERN, IT-SDC
Copyright © 2010 Platform Computing Corporation. All Rights Reserved.1 The CERN Cloud Computing Project William Lu, Ph.D. Platform Computing.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
ATLAS & CMS Online Clouds
glideinWMS: Quick Facts  glideinWMS is an open-source Fermilab Computing Sector product driven by CMS  Heavy reliance on HTCondor from UW Madison and.
Site Infrastructure: Cloud Computing in Action Dr. Silvio Pardi INFN-NAPOLI SCGCCW 2014 –TBILISI Third ATLAS South Caucasus Grid & Cloud Computing Workshop.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Cloud Computing Wrap-Up Fernando Barreiro Megino (CERN IT-ES) Sergey Panitkin.
608D CloudStack 3.0 Omer Palo Readiness Specialist, WW Tech Support Readiness May 8, 2012.
FutureGrid Connection to Comet Testbed and On Ramp as a Service Geoffrey Fox Indiana University Infra structure.
1. Maria Girone, CERN  Q WLCG Resource Utilization  Commissioning the HLT for data reprocessing and MC production  Preparing for Run II  Data.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Successful Common Projects: Structures and Processes WLCG Management.
Cloud Status Laurence Field IT/SDC 09/09/2014. Cloud Date Title 2 SaaS PaaS IaaS VMs on demand.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Exploiting Virtualization & Cloud Computing in ATLAS 1 Fernando H. Barreiro.
Jose Castro Leon CERN – IT/OIS CERN Agile Infrastructure Infrastructure as a Service.
1 The Adoption of Cloud Technology within the LHC Experiments Laurence Field IT/SDC 17/10/2014.
OpenStack cloud at Oxford Kashif Mohammad University of Oxford.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
Agile Infrastructure IaaS Compute Jan van Eldik CERN IT Department Status Update 6 July 2012.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
Arne Wiebalck -- VM Performance: I/O
Tim Bell 04/07/2013 Intel Openlab Briefing2.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
Scaling the CERN OpenStack cloud Stefano Zilli On behalf of CERN Cloud Infrastructure Team 2.
D. Giordano Second Price Enquiry Update L. Field On behalf of the IT-SDC WLC Resource Integration Team.
OpenStack overview of the project Belmiro Daniel Rodrigues Moreira CERN IT-PES-PS January 2011 Disclaimer: This presentation reflects the experience and.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.
OpenStack Chances and Practice at IHEP Haibo, Li Computing Center, the Institute of High Energy Physics, CAS, China 2012/10/15.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1 how to profit of the ATLAS HLT farm during the LS1 & after Sergio Ballestrero.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI CERN and HelixNebula, the Science Cloud Fernando Barreiro Megino (CERN IT)
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
Cisco Consulting Services for Application-Centric Cloud Your Company Needs Fast IT Cisco Application-Centric Cloud Can Help.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Improving resilience of T0 grid services Manuel Guijarro.
CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
Farming Andrea Chierici CNAF Review Current situation.
INDIGO – DataCloud CERN CERN RIA
Using the CMS Higher Level Trigger Farm as a Cloud Resource David Colling Imperial College London.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
CERN IT Department CH-1211 Genève 23 Switzerland The CERN internal Cloud Sebastien Goasguen, Belmiro Rodrigues Moreira, Ewan Roche, Ulrich.
Laurence Field IT/SDC Cloud Activity Coordination
Resource Provisioning Services Introduction and Plans
Elastic Computing Resource Management Based on HTCondor
Smart Cities and Communities and Social Innovation
SUSE® Cloud The Open Source Private Cloud Solution for the Enterprise
Dynamic Deployment of VO Specific Condor Scheduler using GT4
ATLAS Cloud Operations
StratusLab Final Periodic Review
StratusLab Final Periodic Review
A walkthrought by the cloud computing
CPU accounting of public cloud resources
FCT Follow-up Meeting 31 March, 2017 Fernando Meireles
Presentation transcript:

Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL

Agenda  The Agile Infrastructure  Workload Management Systems  Dynamic provisioning  Conclusions Commissioning the CERN IT Agile Infrastructure with experiment workloads 2

The Agile Infrastructure  Private IaaS cloud  OpenStack based  Federates Meyrin and Wigner  15,000 hypervisors by 2015  300,000 VMs by 2015  Configuration management tools  Puppet, Foreman Commissioning the CERN IT Agile Infrastructure with experiment workloads 3

Workload Management Systems  Pilot based systems  ATLAS: PanDA  CMS: glidein WMS  Both have an HTCondor backend  Using Nova and EC2 APIs Commissioning the CERN IT Agile Infrastructure with experiment workloads 4

PanDA integration  Manual HTCondor cluster deployment  Long lived worker nodes  Condor for pilot submission  CVMFS + EOS  Same setup on HLT, Helix Nebula, Rackspace… Commissioning the CERN IT Agile Infrastructure with experiment workloads 5

glidein integration  Dynamic cluster deployment (EC2)  Worker node automatically managed  Condor for batch orchestration  CVMFS + EOS Commissioning the CERN IT Agile Infrastructure with experiment workloads 6

Deployment Commissioning the CERN IT Agile Infrastructure with experiment workloads 7

Support from the AI Team  Got 1,600 cores from the OpenStack team  Testing Essex, Folsom, Grizzly  Complete freedom to access resources  Consultancy at any time  Rapid bug report-solution cycle Commissioning the CERN IT Agile Infrastructure with experiment workloads 8

Testing strategy  Standard HammerCloud benchmark  Compared with other clouds, bare metal  690,000 ATLAS jobs  337,000 CMS jobs Commissioning the CERN IT Agile Infrastructure with experiment workloads 9

Testing summary Commissioning the CERN IT Agile Infrastructure with experiment workloads 10 ATLAS CMS SchedulerPanDA SchedulerglideinWMS Cluster managementStatic Cluster managementDynamic Cluster size770 cores Cluster size200 cores Jobs submitted694,698 Jobs submitted337,080 Failure rate9.95% Failure rate0.31% Job typeSimulation Job typeSimulation Typical job duration31 min. Typical job duration9 min. Duration variance17.8 min. Duration variance4.8 min. Most common errorFailed to read LFC Most common errorApp. Error 8020

Performance testing: ATLAS, Ibex SiteWallclock (s) CPU efficiency(%) Failure rate (%) OPENSTACK_CLOUD3, BNL_CLOUD1, IAAS1, CERN-PROD1, BNL_CVMFS_11, Commissioning the CERN IT Agile Infrastructure with experiment workloads 11  Tested Late 2012  Over commission of resources  CPU efficiency not reliable in this context

Performance testing: ATLAS, Grizzly SiteWallclock (s) CPU efficiency(%) Failure rate (%) OPENSTACK_CLOUD1, BNL_CLOUD1, IAAS1, CERN-PROD1, BNL_CVMFS_11, Commissioning the CERN IT Agile Infrastructure with experiment workloads 12  Tested Late 2013  Good improvement in performance  And predictability

Performance testing: CMS, Ibex SiteWallclock (s) CPU efficiency(%) Failure rate (%) T2_CH_CERN_AI T2_CH_CERN T1_US_FNAL T1_DE_KIT Commissioning the CERN IT Agile Infrastructure with experiment workloads 13  Tested Mid 2013  Reliability is incredible good

Performance testing: ATLAS wallclock Commissioning the CERN IT Agile Infrastructure with experiment workloads 14

Reliability testing  Few failures  Infrastructure vs. WMS failure  Need new monitoring techniques  Difficult to measure with state of the art tools Commissioning the CERN IT Agile Infrastructure with experiment workloads 15

Reliability testing Commissioning the CERN IT Agile Infrastructure with experiment workloads 16

Reliability testing Commissioning the CERN IT Agile Infrastructure with experiment workloads 17

Reliability testing Commissioning the CERN IT Agile Infrastructure with experiment workloads 18

Dynamic provisioning  gLidein scales clusters automatically,  Slowness with non-batch requests  PanDA was still not ready for it  Studying APF and Cloud Scheduler Commissioning the CERN IT Agile Infrastructure with experiment workloads 19

Conclusions  Being able to successfully use the infrastructure  Scalability tests passed Commissioning the CERN IT Agile Infrastructure with experiment workloads 20

Future work  Unification of image lifetime  Federation of other OpenStack clouds  Accounting tools for clouds  Better understanding of failures Commissioning the CERN IT Agile Infrastructure with experiment workloads 21

Questions? Commissioning the CERN IT Agile Infrastructure with experiment workloads 22