Operating a distributed IaaS Cloud Ashok Agarwal, Patrick Armstrong Adam Bishop, Andre Charbonneau, Ronald Desmarais, Kyle Fransham, Roger Impey, Colin.

Slides:

Advertisements

Similar presentations

ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.

Advertisements

Natural Resources Canada Ressources naturelles Canada Canadian Forest Service Service canadien des forêts Conseil national de recherches Canada National.

Ian Gable University of Victoria/HEPnet Canada 1 GridX1: A Canadian Computational Grid for HEP Applications A. Agarwal, P. Armstrong, M. Ahmed, B.L. Caron,

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

Eucalyptus on FutureGrid: A case for Eucalyptus 3 Sharif Islam, Javier Diaz, Geoffrey Fox Gregor von Laszewski Indiana University.

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

Customized cloud platform for computing on your terms !

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

Grid Canada CLS eScience Workshop 21 st November, 2005.

Creating an EC2 Provisioning Module for VCL Cameron Mann & Everett Toews.

OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.

Daniel Vanderster University of Victoria National Research Council and the University of Victoria 1 GridX1 Services Project A. Agarwal, A. Berman, A. Charbonneau,

BaBar Grid Computing Eleonora Luppi INFN and University of Ferrara - Italy.

Presented by: Sanketh Beerabbi University of Central Florida COP Cloud Computing.

Grids and Portals for VLAB Marlon Pierce Community Grids Lab Indiana University.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.

Grid Computing I CONDOR.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.

Large Scale Sky Computing Applications with Nimbus Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes – Bretagne Atlantique Rennes, France

608D CloudStack 3.0 Omer Palo Readiness Specialist, WW Tech Support Readiness May 8, 2012.

1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.

Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.

Image Management and Rain on FutureGrid: A practical Example Presented by Javier Diaz, Fugang Wang, Gregor von Laszewski.

Grid Workload Management Massimo Sgaravatto INFN Padova.

Grid job submission using HTCondor Andrew Lahiff.

Magellan: Experiences from a Science Cloud Lavanya Ramakrishnan.

Ashok Agarwal University of Victoria 1 GridX1 : A Canadian Particle Physics Grid A. Agarwal, M. Ahmed, B.L. Caron, A. Dimopoulos, L.S. Groer, R. Haria,

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Operating a distributed IaaS Cloud for BaBar Randall Sobie, Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Kyle Fransham, Roger Impey, Colin Leavett-

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

GAAIN Virtual Appliances: Virtual Machine Technology for Scientific Data Analysis Arihant Patawari USC Stevens Neuroimaging and Informatics Institute July.

NA61/NA49 virtualisation: status and plans Dag Toppe Larsen CERN

Ian Gable University of Victoria 1 Deploying HEP Applications Using Xen and Globus Virtual Workspaces A. Agarwal, A. Charbonneau, R. Desmarais, R. Enge,

Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.

CPSC 171 Introduction to Computer Science System Software and Virtual Machines.

David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.

Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Tools and techniques for managing virtual machine images Andreas.

Ian Gable HEPiX Spring 2009, Umeå 1 VM CPU Benchmarking the HEPiX Way Manfred Alef, Ian Gable FZK Karlsruhe University of Victoria May 28, 2009.

Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

NA61/NA49 virtualisation: status and plans Dag Toppe Larsen Budapest

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.

Purdue RP Highlights TeraGrid Round Table May 20, 2010 Preston Smith Manager - HPC Grid Systems Rosen Center for Advanced Computing Purdue University.

Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer

Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

BaBar & Grid Eleonora Luppi for the BaBarGrid Group TB GRID Bologna 15 febbraio 2005.

@3 rd DPLTA CERN 7 December 2009 by Homer Neal ( ) BaBar Computing rd DPLTA CERN 7 December 2009 by Homer Neal ( )

Introduction to Distributed Platforms

Virtualisation for NA49/NA61

NA61/NA49 virtualisation:

Dag Toppe Larsen UiB/CERN CERN,

The Life of a MongoDB GitHub Commit

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

Working With Azure Batch AI

Dag Toppe Larsen UiB/CERN CERN,

Joseph JaJa, Mike Smorul, and Sangchul Song

Chapter 2: System Structures

Virtualisation for NA49/NA61

WLCG Collaboration Workshop;

Integration of Singularity With Makeflow

Haiyan Meng and Douglas Thain

Sky Computing on FutureGrid and Grid’5000

Using and Building Infrastructure Clouds for Science

Sky Computing on FutureGrid and Grid’5000

Presentation transcript:

Operating a distributed IaaS Cloud Ashok Agarwal, Patrick Armstrong Adam Bishop, Andre Charbonneau, Ronald Desmarais, Kyle Fransham, Roger Impey, Colin Leavett-Brown, Michael Paterson, Wayne Podaima, Randall Sobie, Matt Vliet HEPiX Spring 2011, GSI University of Victoria, Victoria, Canada National Research Council of Canada, Ottawa Ian Gable

Outline Motivation –HEP Legacy Data Project –CANFAR: Observational Astronomy System Architecture Operational Experience Future work Summary Ian Gable2

Motivation Projects requiring modest resources we believe to be suitable to Infrastructure-as-a-Service (IaaS) Clouds: –The High Energy Physics Legacy Data project –The Canadian Advanced Network for Astronomical Research (CANFAR) We expect an increasing number of IaaS clouds to be available for research computing. Ian Gable3

HEP Legacy Data Project We have been funded in Canada to investigate a possible solution for analyzing BaBar data for the next 5-10 years. Collaborating with SLAC who are also pursuing this goal. We are exploiting VMs and IaaS clouds. Assume we are going to be able run BaBar code in a VM for the next 5-10 years. We hope that results will be applicable to other experiments. 2.5 FTEs for 2 years ends in October Ian Gable4

9.5 million lines of C++ and Fortran Compiled size is 30 GB Significant amount of manpower is required to maintain the software Each installation must be validated before generated results will be accepted Moving between SL 4 and SL 5 required a significant amount of work, and is likely the last version of SL that will be supported Ian Gable5

CANFAR is a partnership between –University of Victoria –University of British Columbia –National Research Council Canadian Astronomy Data Centre –Herzberg Institute for Astrophysics Will provide computing infrastructure for 6 observational astronomy survey projects Ian Gable6

Jobs are embarrassingly parallel, much like HEP. Each of these surveys requires a different processing environment, which require: –A specific version of a Linux distribution –A specific compiler version –Specific libraries Applications have little documentation These environments are evolving rapidly Ian Gable7

How do we manage jobs on IaaS? With IaaS, we can easily create many instances of a VM image How do we Manage the VMs once booted? How do we get jobs to the VMs? Ian Gable8

Possible solutions The Nimbus Context broker allows users to create “One Click Clusters” –Users create a cluster with their VM, run their jobs, then shut it down –However, most users are used to sending jobs to a HTC cluster, then waiting for those jobs to complete –Cluster management is unfamiliar to them –Already used for a big run with STAR in 2009 Univa Grid Engine Submission to Amazon EC2 –Release 6.2 Update 5 can work with EC2 –Only supports Amazon This area is involving very rapidly! Other solutions? Ian Gable9

Our Solution: Condor + Cloud Scheduler Users create a VM with their experiment software installed –A basic VM is created by our group, and users add on their analysis or processing software to create their custom VM Users then create batch jobs as they would on a regular cluster, but they specify which VM should run their images Aside from the VM creation step, this is very similar to the HTC workflow Ian Gable10

Ian Gable11 Research and Commercial clouds made available with some cloud-like interface. Step 1

Ian Gable12 User submits to Condor Job scheduler that has no resources attached to it. Step 2

Ian Gable13 Cloud Scheduler detects that there are waiting jobs in the Condor Queues and then makes request to boot the VMs that match the job requirements. Step 3

Ian Gable14 Step 4 The VMs boot, attach themselves to the Condor Queues and begin draining jobs. Once no more jobs require the VMs Cloud Scheduler shuts them down.

How does it work? 1. A user submits a job to a job scheduler 2. This job sits idle in the queue, because there are no resources yet 3. Cloud Scheduler examines the queue, and determines that there are jobs without resources 4. Cloud Scheduler starts VMs on IaaS clusters 5. These VMs advertise themselves to the job scheduler 6. The job scheduler sees these VMs, and starts running jobs on them 7. Once all of the jobs are done, Cloud Scheduler shuts down the VMs Ian Gable15

Implementation Details We use Condor as our job scheduler –Good at handling heterogeneous and dynamic resources –We were already familiar with it –Already known to be scalable We use Condor Connection broker to get around private IP clouds Primarily support Nimbus and Amazon EC2, with experimental support for OpenNebula and Eucalyptus. Ian Gable16

Implementation Details Cont. Each VM has the Condor startd daemon installed, which advertises to the central manager at start We use a Condor Rank expression to ensure that jobs only end up on the VMs they are intended to Users use Condor attributes to specify the number of CPUs, memory, scratch space, that should be on their VMs We have a rudimentary round robin fairness scheme to ensure that users receive a roughly equal share of resources respects condor priorities Ian Gable17

Condor Job Description File Ian Gable18 Universe = vanilla Executable = red.sh Arguments = W3-3+3 W3%2D3%2B3 Log = red10.log Output = red10.out Error = red10.error should_transfer_files = YES when_to_transfer_output = ON_EXIT # Run-environment requirements Requirements = VMType =?= ”redshift" +VMNetwork = "private" +VMCPUArch = "x86" +VMLoc = " +VMMem = ”2048" +VMCPUCores = "1" +VMStorage = "20" +VMAMI = "ami-fdee0094" Queue

CANFAR: MAssive Compact Halo Objects Detailed re-analysis of data from the MACHO experiment Dark Matter search. Jobs perform a wget to retrieve the input data (40 M) and have a 4-6 hour run time. Low I/O great for clouds. Astronomers happy with the environment. Ian Gable19

Experience with CANFAR Ian Gable20 Shorter Jobs Cluster Outage

VM Run Times (CANFAR) Ian Gable21 Max allowed VM Run time (1 week) n = 32572

VM Boots (CANFAR) Ian Gable22

Experimental BaBar Cloud Resources ResourceCoresNotes Lab100 Cores Allocated Resources allocation to support BaBar Elephant CoresExperimental cloud cluster hosts (xrootd for cloud) NRC Cloud in Ottawa68 CoresHosts VM image repository (repoman) Amazon EC2Proportional to $Grant funding from Amazon Hermes (280 max)Occasional Backfill access Ian Gable23

Ian Gable24 BaBar Cloud Configuration

A Typical Week (Babar) Ian Gable25

BaBar MC production Ian Gable26

Other Examples Ian Gable27

Inside a Cloud VM Ian Gable28

Inside a Cloud VM Cont. Ian Gable29

A batch of User Analysis Jobs Ian Gable30 Sample Jobs I/O rate Tau1N-MC64220 KB/s Tau1N-data KB/s Tau11-MC KB/s Sample Jobs I/O rate Tau1N-MC64220 KB/s Tau1N-data KB/s Tau11-MC KB/s

Cloud I/O for BaBar User Anaysis Ian Gable31 Sample Jobs I/O rate Tau1N-MC64220 KB/s Tau1N-data KB/s Tau11-MC KB/s Sample Jobs I/O rate Tau1N-MC64220 KB/s Tau1N-data KB/s Tau11-MC KB/s

Some Lessons Learned Monitoring cloud resources is difficult –Can you even expect the same kind of knowledge? Debugging user VM problems is hard for users, and hard for support –What do you do when the VM network doesn’t come up. No two EC2 API implementations are the same –Nimbus, OpenNebula, Eucalyptus all different Users nicely insulated from cloud failures. If the VM doesn’t come but the job doesn’t get drained. Ian Gable32

SLAC activities Cloud in a Box: –LTDA Analysis cloud –The idea is to build a secure cloud to run obsolete operating systems without compromising the base OS. –VMs are on a separate vlan, and strict firewall rules are in place. –Users are managed through ldap on an up-to-date system. –Uses Condor / Cloud Scheduler / Nimbus for IaaS. SLAC Team: Homer Neal, Tina Cartaro, Steffen Luitz, Len Moss, Booker Bense, Igor Gaponenko, Wiko Kroeger, Kyle Fransham Ian Gable33

SLAC LTDA Cluster Ian Gable34 Firewall xrootd NIC vm virtual bridge NIC 24TB IaaS client xrootd NIC vm virtual bridge NIC 24TB … (x60) SLAC NTP LDAP, DNS, DHCP NTP LDAP, DNS, DHCP NFS CVS, home, work areas, VM repo NFS CVS, home, work areas, VM repo mySQL User login LRM IaaS LRM IaaS BBR-LTDA-VM BBR-LTDA-SRV BBR-LTDA-LOGIN … HOST RHEL6 Managed HOST RHEL6 Managed VM Guest RHEL5 IaaS client Note: Built using very generic design patterns! Useful to others …

Future Work/Challenges Increasing the the scale –I/O scalability needs to be proven. –Total number of VMs. Security? Leverage work of HEPiX virtualization working group. Booting large numbers of VM quickly on research clouds. –copy on write images (qcow, zfs backed storage)? –BitTorrent Distribution? –Amazon does it so we can too. Ian Gable35

About the code Relatively small python package, lots of cloud interaction examples Ian Gable36 Ian-Gables-MacBook-Pro:cloud-scheduler igable$ cat source_files | xargs wc -l 0./cloudscheduler/__init__.py 1./cloudscheduler/__version__.py 998./cloudscheduler/cloud_management.py 1169./cloudscheduler/cluster_tools.py 362./cloudscheduler/config.py 277./cloudscheduler/info_server.py 1086./cloudscheduler/job_management.py 0./cloudscheduler/monitoring/__init__.py 63./cloudscheduler/monitoring/cloud_logger.py 208./cloudscheduler/monitoring/get_clouds.py 176./cloudscheduler/utilities.py 13./scripts/ec2contexthelper/setup.py 28./setup.py 99 cloud_resources.conf 1046 cloud_scheduler 324 cloud_scheduler.conf 130 cloud_status 5980 total

Summary Modest I/O jobs can be easily handled on IaaS clouds Early experiences are promising More work to show scalability Lots of open questions Ian Gable37

More Information Ian Gable cloudscheduler.org Code on GitHub: – –Run as proper open source project Ian Gable38 Roadmap

Acknowledgements Ian Gable39

Start of extra slides Ian Gable40

CANFAR CANFAR needs to provide computing infrastructure for 6 astronomy survey projects: Ian Gable41 SurveyLeadTelescope Next Generation Virgo Cluster SurveyNGVSUVicCFHT Pan-Andromeda Archaeological SurveyPAndASUBCCFHT SCUBA-2 All Sky SurveySASSyUBCJCMT SCUBA-2 Cosmology Legacy SurveyCLSUBCJCMT Shapes and Photometric Redshifts for Large Surveys SPzLSUBCCFHT Time Variable SkyTVSUVicCFHT CFHT: Canada France Hawaii TelescopeJCMT: James Clerk Maxwell Telescope

Cloud Scheduler Goals Don’t replicate existing functionality. To be able to use existing IaaS and job scheduler software together, today. Users should be able to use the familiar HTC tools. Support VM creation on Nimbus, OpenNebula, Eucalyptus, and EC2, i.e. all likely IaaS resources types people are likely to encounter. Adequate scheduling to be useful to our users Simple architecture Ian Gable42

Ian Gable43 HEPiX Fall 2007 St Louis from Ian Gable HEPiX Fall 2009 NERSC from Tony Cass We have been interested in virtualization for some time. Encapsulation of Applications Good for shared resources Performs well as shown at HEPiX We are interested in pursuing user provided VMs on Clouds. These are steps 4 and 5 as outlined it Tony Cass’ “Vision for Virtualization” talk at HEPiX NERSC.