1 13 Nov 2012 John Hover Virtualized/Cloud-based Facility Resources John Hover US ATLAS Distributed Facility Workshop UC Santa Cruz.

Slides:

Advertisements

Similar presentations

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.

Advertisements

Facility Virtualization Experience at MWT2 Lincoln Bryant Computation and Enrico Fermi Institutes.

System Center 2012 R2 Overview

Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.

OpenStack Open Source Cloud Software. OpenStack: The Mission "To produce the ubiquitous Open Source cloud computing platform that will meet the needs.

HPC Pack On-Premises On-premises clusters Ability to scale to reduce runtimes Job scheduling and mgmt via head node Reliability HPC Pack Hybrid.

1 October 2013 APF Summary Oct 2013 John Hover John Hover.

An Approach to Secure Cloud Computing Architectures By Y. Serge Joseph FAU security Group February 24th, 2011.

“It’s going to take a month to get a proof of concept going.” “I know VMM, but don’t know how it works with SPF and the Portal” “I know Azure, but.

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

Public and Private Clouds: Working Together

Cloud Computing Why is it called the cloud?.

Brainstorming Thin-provisioned Tier 3s.

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

OSG Public Storage and iRODS

US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.

BESIII distributed computing and VMDIRAC

Ceph Storage in OpenStack Part 2 openstack-ch,

OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.

Presented by: Sanketh Beerabbi University of Central Florida COP Cloud Computing.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Cloud Computing Wrap-Up Fernando Barreiro Megino (CERN IT-ES) Sergey Panitkin.

Eucalyptus 3 (&3.1). Eucalyptus 3 Product Overview – Govind Rangasamy.

Configuration Management with Cobbler and Puppet Kashif Mohammad University of Oxford.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

1 Resource Provisioning Overview Laurence Field 12 April 2015.

David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.

Windows Azure Migrating Applications and Workloads Speaker Title Organization.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Exploiting Virtualization & Cloud Computing in ATLAS 1 Fernando H. Barreiro.

Jose Castro Leon CERN – IT/OIS CERN Agile Infrastructure Infrastructure as a Service.

OpenStack cloud at Oxford Kashif Mohammad University of Oxford.

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

NA61/NA49 virtualisation: status and plans Dag Toppe Larsen CERN

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.

CoprHD and OpenStack Ideas for future.

2012 Objectives for CernVM. PH/SFT Technical Group Meeting CernVM/Subprojects The R&D phase of the project has finished and we continue to work as part.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL

3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

Scaling the CERN OpenStack cloud Stefano Zilli On behalf of CERN Cloud Infrastructure Team 2.

OpenStack overview of the project Belmiro Daniel Rodrigues Moreira CERN IT-PES-PS January 2011 Disclaimer: This presentation reflects the experience and.

1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.

OpenStack Chances and Practice at IHEP Haibo, Li Computing Center, the Institute of High Energy Physics, CAS, China 2012/10/15.

1 Plans for “Clouds” in the U.S. ATLAS Facilities Michael Ernst Brookhaven National Laboratory.

Predrag Buncic (CERN/PH-SFT) CernVM Status. CERN, 24/10/ Virtualization R&D (WP9)  The aim of WP9 is to provide a complete, portable and easy.

Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

1 11 March 2013 John Hover Brookhaven Laboratory Cloud Activities Update John Hover, Jose Caballero US ATLAS T2/3 Workshop Indianapolis, Indiana.

Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

Canadian Bioinformatics Workshops

Parrot and ATLAS Connect

Course: Cluster, grid and cloud computing systems Course author: Prof

Cloud Technology and the NGS Steve Thorn Edinburgh University (Matteo Turilli, Oxford University)‏ Presented by David Fergusson.

Use of HLT farm and Clouds in ALICE

Elastic Computing Resource Management Based on HTCondor

SUSE® Cloud The Open Source Private Cloud Solution for the Enterprise

Blueprint of Persistent Infrastructure as a Service

Dag Toppe Larsen UiB/CERN CERN,

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

Dag Toppe Larsen UiB/CERN CERN,

Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")

ATLAS Cloud Operations

StratusLab Final Periodic Review

StratusLab Final Periodic Review

OpenStack-alapú privát felhő üzemeltetés

Presentation transcript:

1 13 Nov 2012 John Hover Virtualized/Cloud-based Facility Resources John Hover US ATLAS Distributed Facility Workshop UC Santa Cruz

2 13 Nov 2012 John Hover Outline Rationale/Benefits Recap Dependencies/Limitations Current BNL Status – Openstack – VMs with Boxgrinder – Panda queue – Local batch virtualization support Next Steps and Plans Discussion

3 13 Nov 2012 John Hover Overall Rationale Recap Why Cloud vs. current Grid? Common interface for end-user virtualization management, thus.. Easy expansion to external cloud resources-- same workflow to expand to: Commercial and academic cloud resources. Future OSG and DOE site cloud resources. Dynamic repartitioning of local resources, easier for site admins. Includes all benefits of non-Cloud virtualization: customized OS environments for reliable opportunistic usage. Flexible facility management: Reboot host nodes without draining queues. Move running VMs to other hosts. Flexible VO usage: Rapid prototyping and testing of platforms for experiments.

4 13 Nov 2012 John Hover Dependencies/Limitations Using Cloud in USATLAS facility contexts will require: – X509 authentication mechanism. Current platform implementations all require username/passwords. – Accounting mechanism. – Automated, supported install and configuration. – All of these need to happen under the auspices of OSG. Cloud computing will be a topic at the next Blueprint. Add x509 to Openstack? Intrusive: Fundamental change – Does represent a new lowest-level resource management layer. – But, once adopted all current management can still be used. Networking and Security – Public IPs require some DNS delegation, may also require additional addresses. (No public IPs at BNL due to lack of delegation). – Some sites may have security issues with the Cloud model.

5 13 Nov 2012 John Hover Dependencies/Limitations Inconsistent behavior, bugs, immature software: – shutdown -h means destroy instance on EC2, but means shut off on OpenStack (leaving the instance to count against quota). – When starting large numbers of VMs, sometimes a few enter ERROR state, requiring removal – Boxgrinder requires patches for mixed lib, and SL5/EC2. – No image upload to OpenStack v 4.0. (yet) ATLAS infrastructure not originally designed to be fully dynamic: – E.g. fully programmatic Panda site creation and destruction – DDM assumes persistent endpoints – Others? Any element that isn't made to be created, managed, and cleanly deleted programmatically.

6 13 Nov 2012 John Hover BNL Openstack Cloud Openstack 4.0 (Essex) Openstack 4.0 (Essex) – 1 Controller, 100 execute hosts (~300 2GB VMs), fairly recent hardware (3 years), KVM virtualization w/ hardware support. – Per-rack network partitioning (10Gb throughput shared) – Provides EC2 (nova), S3 (swift), and an image service (glance). – Essex adds keystone identity/auth service, Dashboard. – Programmatically deployed, with configs publically available. – Fully automated compute-node installation/setup (Puppet) – Enables 'tenants'; partitions VMs into separate authentication groups, such that users cannot terminate (or see) each other's VMs. Three projects currently. – Winning platform war--CERN switching to OpenStack BNL sent 2 people to Openstack confernce, CERN attended.

7 13 Nov 2012 John Hover BNL Openstack Layout

8 13 Nov 2012 John Hover BNL Openstack Cloud (cont) Ancillary supporting resources: SVN Repositories for configuration and development E.g. boxgrinder appliance definitions. – See: YUM repos: snapshots (OSG, EPEL dependencies, etc): mirrors grid dev, test, release – See:

9 13 Nov 2012 John Hover BNL VM Authoring Programmatic ATLAS Worker Node VM creation using Boxgrinder: – – Notable features: – Modular appliance inheritance. The wn-atlas definition inherits the wn-osg profile, which in turn inherits from base. – Connects back to static Condor schedd for jobs. – BG creates images dynamically for kvm/libvirt, EC2, virtualbox, vmware via 'platform plugins'. – BG can upload built images automatically to Openstack (v3), EC2, libvirt, or local directory via 'delivery plugins'.

10 13 Nov 2012 John Hover Simple WN Recipe Build and upload svn co boxgrinder-build -f boxgrinder/sl5-x86_64-wn-atlas-bnlcloud.appl. ~/nova-essex2/novarc glance add name=sl5-atlas-wn-mysite is_public=true disk_format=raw container_format=bare --host=cldext03.usatlas.bnl.gov --port=9292 < build/appliances/x86_64/sl/5/sl5-x86_64-wn-atlas-bnlcloud/3.0/sl-plugin/sl5- x86_64-wn-atlas-bnlcloud-sda.raw Describe and run images euca-describe-images --config=~/nova-essex2/novarc euca-run-instances --config ~/nova-essex2/novarc -n 5 ami euca-describe-instances --config ~/nova-essex2/novarc

11 13 Nov 2012 John Hover Panda BNL_CLOUD – Standard production Panda site. – Configured to use wide-area stagein/out (SRM, LFC), so same cluster can be extended transparently to Amazon or other public academic clouds. – Steadily running ~200 prod jobs on auto-built VMs for months. 0% job failure. – HC tests, auto-exclude enabled -- no problems so far. – Performance actually better than main BNL prod site. (slide follows). – Also ran hybrid Openstack/EC2 cluster for a week. No problems.

12 13 Nov 2012 John Hover Performance BNL_CLOUD: BNL_CVMFS_1: Result: BNL_CLOUD actually faster (1740s vs. 1960s) – Hammercloud (ATLASG4_trf_7.2...) – Setup time (no AFS)? No shared filesystem? – Similar spread for other tests (e.g., PFT Evgen ) We would welcome an y Hammercloud/ Panda experts who would like to gather stats.

13 13 Nov 2012 John Hover Local RACF Batch Virtualization RACF Linux Farm group provides VM job type Provided by custom libvirt wrappe (not Condor-VM). End-user specifies what they want in job definition. End-user cannot provide VM image. Available on 10% of the cluster, not intended for large scale. Used to run specific experiment (STAR/PHENIX) work on SL4.Plans Now have hardware for larger-scale prototype for Condor-VM based full facility virtualization. Challeges: I/O concerns NFS security issues (privileged port). IP address space. (Info provided by Alexander Zaytsev)

14 13 Nov 2012 John Hover Plans 1 Cross-group Puppet group – Initiated at last S&C week. Includes CERN, ATLAS, and USATLAS people. Built around – Shared config repo set up at puppethttps://svn.usatlas.bnl.gov/svn/atlas- puppet Refine setup and management – Generalize Puppet classes for node setup for use by other sites. – Finish automating controller setup via Puppet. – Facilitate wider use of BNL instance by other USATLAS groups.

15 13 Nov 2012 John Hover Plans 2 Decide on final UserData/Runtime contextualization: – Will allow the Tier 2/3 Cloud group to use our base images for their work. – Completely parameterize our current VM APF development – Full support for VM lifecycle, with flexible algorithms to decide when to terminate running VMs. – Cascading hierarchies of Cloud targets: Will allow programatic scheduling of jobs on site-local->other private->and commercial clouds based on job priority and cloud cost.

16 13 Nov 2012 John Hover Plans 3 Expand usage and scale – Scaled up to a size comparable to main Grid site. – Dynamically re-partitionable between groups. – Option to transparently expand to other private clouds, and/or commercial clouds. – Run large-scale Fully documented dynamic Fabric and VM configuration – To allow replication/customization of both site fabric deployment and job workflow by other ATLAS sites, and by other OSG VOs. – To allow specialized usage of our Cloud by users.

17 13 Nov 2012 John Hover Questions?

18 13 Nov 2012 John Hover Discussion

19 13 Nov 2012 John Hover

20 13 Nov 2012 John Hover Extra Slides

21 13 Nov 2012 John Hover

22 13 Nov 2012 John Hover Elastic Prod Cluster: Components Static Condor schedd – Standalone, used only for Cloud work. 2 AutoPyFactory (APF) Queues – One observes a local Condor schedd, when jobs are Idle, submits WN VMs to IaaS (up to some limit). When WNs are Unclaimed, shuts them down. Another observes a Panda queue, when jobs are activated, submits pilots to local cluster Condor queue. Worker Node VMs – Generic Condor startds associated connect back to local Condor cluster. All VMs are identical, don't need public Ips, and don't need to know about each other. Panda site – Associated with BNL SE, LFC, CVMFS-based releases. – But no site-internal configuration (NFS, file transfer, etc).

23 13 Nov 2012 John Hover Programmatic Repeatability, Extensibility The key feature of our work has been to make all our process and configs general and public, so others can use it. Except for pilot submission (AutoPyFactory), we have used only standard, widely used technology (RHEL/SL, Condor, Boxgrinder, Openstack). – Our boxgrinder definitions are published. – All source repositories are public and usable over the internet,e.g.: Snapshots of external repositories, for consistent builds: – – Custom repo: – – Our Openstack host configuration Puppet manifests are published and will be made generic enough to be borrowed. – Our VM contextualization process will be documented and usable by other OSG VOs/ ATLAS groups.