Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL
Agenda The Agile Infrastructure Workload Management Systems Dynamic provisioning Conclusions Commissioning the CERN IT Agile Infrastructure with experiment workloads 2
The Agile Infrastructure Private IaaS cloud OpenStack based Federates Meyrin and Wigner 15,000 hypervisors by 2015 300,000 VMs by 2015 Configuration management tools Puppet, Foreman Commissioning the CERN IT Agile Infrastructure with experiment workloads 3
Workload Management Systems Pilot based systems ATLAS: PanDA CMS: glidein WMS Both have an HTCondor backend Using Nova and EC2 APIs Commissioning the CERN IT Agile Infrastructure with experiment workloads 4
PanDA integration Manual HTCondor cluster deployment Long lived worker nodes Condor for pilot submission CVMFS + EOS Same setup on HLT, Helix Nebula, Rackspace… Commissioning the CERN IT Agile Infrastructure with experiment workloads 5
glidein integration Dynamic cluster deployment (EC2) Worker node automatically managed Condor for batch orchestration CVMFS + EOS Commissioning the CERN IT Agile Infrastructure with experiment workloads 6
Deployment Commissioning the CERN IT Agile Infrastructure with experiment workloads 7
Support from the AI Team Got 1,600 cores from the OpenStack team Testing Essex, Folsom, Grizzly Complete freedom to access resources Consultancy at any time Rapid bug report-solution cycle Commissioning the CERN IT Agile Infrastructure with experiment workloads 8
Testing strategy Standard HammerCloud benchmark Compared with other clouds, bare metal 690,000 ATLAS jobs 337,000 CMS jobs Commissioning the CERN IT Agile Infrastructure with experiment workloads 9
Testing summary Commissioning the CERN IT Agile Infrastructure with experiment workloads 10 ATLAS CMS SchedulerPanDA SchedulerglideinWMS Cluster managementStatic Cluster managementDynamic Cluster size770 cores Cluster size200 cores Jobs submitted694,698 Jobs submitted337,080 Failure rate9.95% Failure rate0.31% Job typeSimulation Job typeSimulation Typical job duration31 min. Typical job duration9 min. Duration variance17.8 min. Duration variance4.8 min. Most common errorFailed to read LFC Most common errorApp. Error 8020
Performance testing: ATLAS, Ibex SiteWallclock (s) CPU efficiency(%) Failure rate (%) OPENSTACK_CLOUD3, BNL_CLOUD1, IAAS1, CERN-PROD1, BNL_CVMFS_11, Commissioning the CERN IT Agile Infrastructure with experiment workloads 11 Tested Late 2012 Over commission of resources CPU efficiency not reliable in this context
Performance testing: ATLAS, Grizzly SiteWallclock (s) CPU efficiency(%) Failure rate (%) OPENSTACK_CLOUD1, BNL_CLOUD1, IAAS1, CERN-PROD1, BNL_CVMFS_11, Commissioning the CERN IT Agile Infrastructure with experiment workloads 12 Tested Late 2013 Good improvement in performance And predictability
Performance testing: CMS, Ibex SiteWallclock (s) CPU efficiency(%) Failure rate (%) T2_CH_CERN_AI T2_CH_CERN T1_US_FNAL T1_DE_KIT Commissioning the CERN IT Agile Infrastructure with experiment workloads 13 Tested Mid 2013 Reliability is incredible good
Performance testing: ATLAS wallclock Commissioning the CERN IT Agile Infrastructure with experiment workloads 14
Reliability testing Few failures Infrastructure vs. WMS failure Need new monitoring techniques Difficult to measure with state of the art tools Commissioning the CERN IT Agile Infrastructure with experiment workloads 15
Reliability testing Commissioning the CERN IT Agile Infrastructure with experiment workloads 16
Reliability testing Commissioning the CERN IT Agile Infrastructure with experiment workloads 17
Reliability testing Commissioning the CERN IT Agile Infrastructure with experiment workloads 18
Dynamic provisioning gLidein scales clusters automatically, Slowness with non-batch requests PanDA was still not ready for it Studying APF and Cloud Scheduler Commissioning the CERN IT Agile Infrastructure with experiment workloads 19
Conclusions Being able to successfully use the infrastructure Scalability tests passed Commissioning the CERN IT Agile Infrastructure with experiment workloads 20
Future work Unification of image lifetime Federation of other OpenStack clouds Accounting tools for clouds Better understanding of failures Commissioning the CERN IT Agile Infrastructure with experiment workloads 21
Questions? Commissioning the CERN IT Agile Infrastructure with experiment workloads 22