Provisioning 160,000 cores with HEPCloud at SC17

Slides:



Advertisements
Similar presentations
University of Notre Dame
Advertisements

System Center 2012 R2 Overview
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.
Virtual Machine Usage in Cloud Computing for Amazon EE126: Computer Engineering Connor Cunningham Tufts University 12/1/14 “Virtual Machine Usage in Cloud.
Cloud Computing Imranul Hoque. Today’s Cloud Computing.
Overview Of Microsoft New Technology ENTER. Processing....
CMS Experience Provisioning Cloud Resources with GlideinWMS Anthony Tiradani HTCondor Week May 2015.
Deep Dive into Windows Azure Virtual Machines – From Cloud Vendor and Enterprise Perspective Vijay Rajagopalan Principal Lead Program Manager Microsoft.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
WORKFLOWS IN CLOUD COMPUTING. CLOUD COMPUTING  Delivering applications or services in on-demand environment  Hundreds of thousands of users / applications.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Migrating Applications to Windows Azure Virtual Machines Michael Washam Senior Technical Evangelist Microsoft Corporation.
Assignment 3: A Team-based and Integrated Term Paper and Project Semester 1, 2012.
Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over the Internet. Cloud is the metaphor for.
Microsoft Azure Virtual Machines. Networking Compute Storage Virtual Machine Operating System Applications Data & Access Runtime Provision & Manage.
Roger Jones, Lancaster University1 Experiment Requirements from Evolving Architectures RWL Jones, Lancaster University Ambleside 26 August 2010.
Software Architecture
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
Presented by: Sanketh Beerabbi University of Central Florida COP Cloud Computing.
1 Resource Provisioning Overview Laurence Field 12 April 2015.
Windows Azure Virtual Machines Anton Boyko. A Continuous Offering From Private to Public Cloud.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Claudio Grandi INFN Bologna CMS Computing Model Evolution Claudio Grandi INFN Bologna On behalf of the CMS Collaboration.
Cloud Computing is a Nebulous Subject Or how I learned to love VDF on Amazon.
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University
HPC In The Cloud Case Study: Proteomics Workflow
Unit 3 Virtualization.
Accessing the VI-SEEM infrastructure
Dynamic Extension of the INFN Tier-1 on external resources
Review of the WLCG experiments compute plans
Chapter 6: Securing the Cloud
Running LHC jobs using Kubernetes
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
C Loomis (CNRS/LAL) and V. Floros (GRNET)
Virtualization and Clouds ATLAS position
Working With Azure Batch AI
ATLAS Cloud Operations
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Cloud Adoption Framework
How to enable computing
“The HEP Cloud Facility: elastic computing for High Energy Physics” Outline Computing Facilities for HEP need to evolve to address the new needs of the.
Cloud Computing Platform as a Service
Tools and Services Workshop Overview of Atmosphere
ATLAS Sites Jamboree, CERN January, 2017
Design and Implement Cloud Data Platform Solutions
AWS COURSE DEMO BY PROFESSIONAL-GURU. Amazon History Ladder & Offering.
Introduction to Cloud Computing
02 | Design and implement database
Management of Virtual Execution Environments 3 June 2008
Grid Canada Testbed using HEP applications
Cloud Computing Dr. Sharad Saxena.
Haiyan Meng and Douglas Thain
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Cloud Management Mechanisms
ExaO: Software Defined Data Distribution for Exascale Sciences
Internet and Web Simple client-server model
Technical Capabilities
Internet Protocols IP: Internet Protocol
Prof. Leonardo Mostarda University of Camerino
Distributing META-pipe on ELIXIR compute resources
Agenda Need of Cloud Computing What is Cloud Computing
Building an Elastic Batch System with Private and Public Clouds
Presentation transcript:

Provisioning 160,000 cores with HEPCloud at SC17 Phil DeMar (FNAL) LHCOPN/LHCONE Meeting – April 04, 2017

Challenge: Doubling CMS computing using Google Cloud Engine Live demo during Supercomputing 2016 Four days, 12 hours a day Expand the Fermilab facility to an additional 160,000 cores Production computing Use HEPCloud technology to do this as transparently as possible to the application (B. Holtzman; FNAL)

HEPCloud A portal to an ecosystem of diverse computing resources, commercial or academic Provides “complete solutions” to users, with agreed-upon levels of service Routes to local or remote resources based on workflow requirements, cost, and efficiency of accessing various resources Manages allocations of users to supercomputing facilities (e.g. LCFs, NERSC) Pilot project to explore feasibility, capabilities of HEPCloud Collaborative effort with industry, academia Previously evaluated with AWS for NOvA computing Goal of moving into production by September 2018 (B. Holtzman; FNAL)

HEPCloud Architecture Workflow (resource provisioning trigger) Facility Interface Authentication and Authorization Monitoring takes input from all components Decision Engine Facility Pool Monitoring Provisioner Cloud Resources Cloud Resources Grid Resources Cloud Resources Local Resources HPC Resources (B. Holtzman; FNAL)

Pre-staging input data to Google Cloud Storage Google Compute Engine 10 Gb/s Input Data Cloud Storage 500 TB Multi-regional Bucket Data Placement Service File Transfer Service Object Store Experiment-specific data placement service (“PhEDEx”) tracks datasets, schedules transfers File Transfer Service supports S3-compatibility mode (gfal-copy, davix) Google Cloud Storage mounted into preemptible VMs using gcsfuse via startup scripts Google to ESNet peering (via Equinix @SL) upgraded to 100 Gb/s capacity (but staging used 10Gb/s…) Converted multi-regional to regional bucket overnight: resulted in 30% less cost (B. Holtzman; FNAL)

Cores from Google (B. Holtzman; FNAL)

(B. Holtzman; FNAL)

Tale of the tape 6.35 M wallhours used; 5.42 M wallhours for completed jobs. 730172 simulation jobs submitted; only 47 did not complete Most wasted hours during ramp-up as we found and eliminated issues; goodput was at 94% during the last 3 days. Costs on Google Cloud during Supercomputing 2016 $71k virtual machine costs $8.6k network egress $8.5k magnetic persistent disk (attached to VMs) $3.5k cloud storage for input data 205 M physics events generated, yielding 81.8 TB of data Cost: ~1.6 cents per core-hour (on-premises: 0.9 cents per core-hour assuming 100% utilization) (B. Holtzman; FNAL)

Additional/Backup Slides

Provisioning remote resources via glideinWMS condor submit HTCondor Schedulers HTCondor Schedulers HTCondor Central Manager Frontend VO Frontend Pull Job Cloud Provider GlideinWMS Factory HTCondor-G VM Virtual Machine Glidein HTCondor Startd Job GlideinWMS submits “pilot jobs” to compute resources based on demand Pilot jobs execute on the resource and fetch user jobs from a queue Pilot jobs hide heterogeneity of compute from the user and validate environment (will not start user jobs on bad resources) (B. Holtzman; FNAL)

HTCondor: speaking Cloud APIs Google Compute Engine Cloud APIs HTCondor provisioner HTCondor provisioner initially written by HTCondor team @ UW-Madison Google contributed to the Open Source HTCondor project Added support for preemptible VMs and service accounts Fixed critical bug to address scaling (B. Holtzman; FNAL)

Providing application software in a distributed world CernVM-FS HTTP Content Distribution Network OS Kernel FUSE VM Stratum 1 Web Server Entire Software Stack Worker Node Memory Buffer Worker Node Disk Cache Megabytes Gigabytes Terabytes (B. Holtzman; FNAL)

Architecture inside a single zone Worker nodes Google Compute Engine Preemptible VM Compute Engine Multiple Instances Cloud APIs Input Data Cloud Storage HTCondor provisioner Internal LB Cloud Load Balancing HTCondor head node Managed instance group Squid caches for application software, job-by-job calibration data Compute Node Standard VM Multiple Instances Object Store (B. Holtzman; FNAL)

Using 4 zones in us-central-1 us-central-1a Cloud APIs Preemptible VM Compute Engine Multiple Instances Internal LB Cloud Load Balancing Managed instance group Compute Node Standard VM us-central-1c Internal LB Cloud Load Balancing Preemptible VM Compute Engine Multiple Instances Cloud APIs Managed instance group Compute Node Standard VM Multiple Instances Input Data Cloud Storage us-central-1b Cloud APIs Preemptible VM Compute Engine Multiple Instances Internal LB Cloud Load Balancing Managed instance group Compute Node Standard VM us-central-1f Cloud APIs Preemptible VM Compute Engine Multiple Instances Internal LB Cloud Load Balancing Managed instance group Compute Node Standard VM (B. Holtzman; FNAL)

Some lessons learned at scale Standard VM (3.75 GB) had more memory than the applications need Custom machine type with 2 GB 20% cost savings Bug in HTCondor provisioning code Ignoring the pagination API Only triggered above 500 VMs! Patch provided by Google Expanded subnet from 4096 to 16384 IPs gcloud compute networks subnets expand-ip-range But had firewall rule on the squid caches: Allow-internal-squid 10.128.0.0/20 tcp:3128 (B. Holtzman; FNAL)

Next steps HEPCloud moves into production in September 2018 Decision engine (when and how much to provision) is in R&D Supercomputers at Department of Energy Facilities Already provisioning cycles on Edison, Cori at NERSC Additional commercial cloud providers Done: Google Cloud Platform, Amazon Web Services Next: Microsoft Azure, ? Non-pleasingly parallel problems Deep learning New architectures (B. Holtzman; FNAL)