Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Slides:



Advertisements
Similar presentations
Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.
Advertisements

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.
1 October 2013 APF Summary Oct 2013 John Hover John Hover.
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
Ian Gable University of Victoria/HEPnet Canada 1 GridX1: A Canadian Computational Grid for HEP Applications A. Agarwal, P. Armstrong, M. Ahmed, B.L. Caron,
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Minerva Infrastructure Meeting – October 04, 2011.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Building service testbeds on FIRE D5.2.5 Virtual Cluster on Federated Cloud Demonstration Kit August 2012 Version 1.0 Copyright © 2012 CESGA. All rights.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Nimbus & OpenNebula Young Suk Moon. Nimbus - Intro Open source toolkit Provides virtual workspace service (Infrastructure as a Service) A client uses.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Grid Canada CLS eScience Workshop 21 st November, 2005.
Operating a distributed IaaS Cloud Ashok Agarwal, Patrick Armstrong Adam Bishop, Andre Charbonneau, Ronald Desmarais, Kyle Fransham, Roger Impey, Colin.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
Daniel Vanderster University of Victoria National Research Council and the University of Victoria 1 GridX1 Services Project A. Agarwal, A. Berman, A. Charbonneau,
Grid Appliance – On the Design of Self-Organizing, Decentralized Grids David Wolinsky, Arjun Prakash, and Renato Figueiredo ACIS Lab at the University.
Presented by: Sanketh Beerabbi University of Central Florida COP Cloud Computing.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.
3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.
Grid job submission using HTCondor Andrew Lahiff.
Ashok Agarwal University of Victoria 1 GridX1 : A Canadian Particle Physics Grid A. Agarwal, M. Ahmed, B.L. Caron, A. Dimopoulos, L.S. Groer, R. Haria,
Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.
Operating a distributed IaaS Cloud for BaBar Randall Sobie, Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Kyle Fransham, Roger Impey, Colin Leavett-
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Exploiting Virtualization & Cloud Computing in ATLAS 1 Fernando H. Barreiro.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Review of Condor,SGE,LSF,PBS
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
NA61/NA49 virtualisation: status and plans Dag Toppe Larsen CERN
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Ian Gable University of Victoria 1 Deploying HEP Applications Using Xen and Globus Virtual Workspaces A. Agarwal, A. Charbonneau, R. Desmarais, R. Enge,
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Ian Gable HEPiX Spring 2009, Umeå 1 VM CPU Benchmarking the HEPiX Way Manfred Alef, Ian Gable FZK Karlsruhe University of Victoria May 28, 2009.
Application Programming in Cloud via Swift Swift Tutorial, CCGrid 2013, Hour 2 Ketan Maheshwari.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
NA61/NA49 virtualisation: status and plans Dag Toppe Larsen Budapest
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI VM Management Chair: Alexander Papaspyrou 2/25/
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI CERN and HelixNebula, the Science Cloud Fernando Barreiro Megino (CERN IT)
EGI-InSPIRE RI EGI Webinar EGI-InSPIRE RI Porting your application to the EGI Federated Cloud 17 Feb
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI CloudBroker usage Zoltán Farkas MTA SZTAKI LPDS
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
BaBar & Grid Eleonora Luppi for the BaBarGrid Group TB GRID Bologna 15 febbraio 2005.
CE design report Luigi Zangrando
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Elastic Computing Resource Management Based on HTCondor
Virtualisation for NA49/NA61
NA61/NA49 virtualisation:
Dag Toppe Larsen UiB/CERN CERN,
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
How to connect your DG to EDGeS? Zoltán Farkas, MTA SZTAKI
Dag Toppe Larsen UiB/CERN CERN,
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
ATLAS Cloud Operations
DIRAC services.
Virtualisation for NA49/NA61
WLCG Collaboration Workshop;
Haiyan Meng and Douglas Thain
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
Presentation transcript:

Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall Sobie, Ryan Taylor ATLAS Software & Computing Workshop University of Victoria, Victoria, Canada National Research Council of Canada, Ottawa Ian Gable

Outline Strategy and Goals Condor +Cloud Scheduler overview Proof of concept Status Summary Ian Gable2

Strategy and Goals Ian Gable3 Use Condor + Cloud Scheduler to execute Panda pilots on IaaS VMs. Use CernVM Batch node (has Condor already) Condor + Cloud Scheduler handles the creation of the Virtual machines and the movement of the pilot credentials onto the VMs Make it work without any modifications to how panda works now.

Condor + Cloud Scheduler Cloud Scheduler is a simple python package created by UVic and NRC to boot VMs on IaaS clouds based on Condor queues. Users create a VM with their experiment software installed –A basic VM is created by our group, and users add on their analysis or processing software to create their custom VM –For experiments that are supported CernVM is the best option. Users then create condor batch jobs as they would on a regular cluster, but they specify which VM should run their images. Aside from the VM creation step, this is very similar to the regular HTC workflow. Ian Gable4

Implementation Details Condor used as Job Scheduler –VMs contextualized with Condor Pool URL and service certificate. –Each VM has the Condor startd daemon installed, which advertises to the central manager at start. –GSI host authentication used when VMs joining pools –User credentials delegated to VMs after boot by job submission –Condor Connection broker to get around private IP clouds Cloud Scheduler –User proxy certs used for authenticating with IaaS service where possible (Nimbus). Otherwise using secret API key (EC2 Style). –Can communicate with Condor using SOAP interface (slow at scale) or via condor_q –Primarily support Nimbus and Amazon EC2, with experimental support for OpenNebula and Eucalyptus –VMs reused until there are no more jobs requiring that VM type. –Images pulled via http or must preexist at the site. Ian Gable5

Condor Job Description File Ian Gable6 Universe= vanilla should_transfer_files = YES when_to_transfer_output = ON_EXIT Requirements = VMType =?= “BaBarAnalysis-52” +VMLoc = " +VMCPUArch = "x86_64" +VMStorage = "1" +VMCPUCores = "1" +VMMem = "2555" +VMAMI = "ami-64ea1a0d" +VMInstanceType = "m1.small" +VMJobPerCore = True getenv = True Queue Custom condor attributes Cloud Scheduler uses the custom condor attributes to compose a call to IaaS service. EC2 rest interfaces supported in addition to Nimbus IaaS with X.509

Proof of Concept Ian Gable7

Initial Result Ian Gable8 Basic proof of concept works with manual submission of pilot jobs to a single cloud. Running on Hermes Cloud at University of Victoria (~300 cores, in production operation since June 2010, see extra slides for details) The data transfer via dccp to the VM is working. Log output got transferred to the SE and back to the panda server. Minor problems running the pilot inside CernVM. (some outdated configs, and learning for us)

Possible Improvements Ian Gable9 Add a pilot factory Add more clouds

Possible Simplification Ian Gable10 Try using development version of Condor (7.7+) with support for EC2 grid resource type. UVic has tested with Amazon proper but not with other cloud types. What to do with credentials? How to manage access to multiple clouds?

Summary Basic proof of concept works, tested with small number of jobs. –Work to add pilot factory –Already known to scale to hundreds of VMs will it scale beyond? Setup a little complex, could possibly be simplified with some thought about how to deal with credentials. Ian Gable11

Beginning of Extra Slides Ian Gable12

About the code Relatively small python package, lots of cloud interaction examples Ian Gable13 Ian-Gables-MacBook-Pro:cloud-scheduler igable$ cat source_files | xargs wc -l 0./cloudscheduler/__init__.py 1./cloudscheduler/__version__.py 998./cloudscheduler/cloud_management.py 1169./cloudscheduler/cluster_tools.py 362./cloudscheduler/config.py 277./cloudscheduler/info_server.py 1086./cloudscheduler/job_management.py 0./cloudscheduler/monitoring/__init__.py 63./cloudscheduler/monitoring/cloud_logger.py 208./cloudscheduler/monitoring/get_clouds.py 176./cloudscheduler/utilities.py 13./scripts/ec2contexthelper/setup.py 28./setup.py 99 cloud_resources.conf 1046 cloud_scheduler 324 cloud_scheduler.conf 130 cloud_status 5980 total

HEP Legacy Data Project We have been funded in Canada to investigate a possible solution for analyzing BaBar data for the next 5-10 years. Collaborating with SLAC who are also pursuing this goal. We are exploiting VMs and IaaS clouds. Assume we are going to be able run BaBar code in a VM for the next 5-10 years. We hope that results will be applicable to other experiments. 2.5 FTEs for 2 years ends in October Ian Gable14

Experience with CANFAR Ian Gable15 Shorter Jobs Cluster Outage

VM Run Times (CANFAR) Ian Gable16 Max allowed VM Run time (1 week) n = VMs booted since July 2010

VM Boots (CANFAR) Ian Gable17

Job Submission We use a Condor Rank expression to ensure that jobs only end up on the VMs they are intended to. Users use Condor attributes to specify the number of CPUs, memory, scratch space, that should be on their VMs (only works with IaaS that supports this). Users can also specify EC2 Amazon Machine Image (AMI) ID if Cloud Scheduler is configured for access to this type of cloud. Users prevented from running on VMs which were not booted for them (condor start requirement checked against DN). Ian Gable18

Ian Gable19 BaBar Cloud Configuration

Other Examples Ian Gable20

Inside a Cloud VM Ian Gable21

BaBar MC production Ian Gable22 More resource added

Motivation Projects requiring modest resources we believe to be suitable to Infrastructure-as-a-Service (IaaS) Clouds: –The High Energy Physics Legacy Data project (2.5 FTEs). Target application BaBar MC and user analysis. –The Canadian Advanced Network for Astronomical Research (CANFAR). Supporting observational astronomy data analysis. Jobs are embarrassingly parallel like HEP( 1 image is like one event). We expect an increasing number of IaaS clouds to be available for research computing. Ian Gable23

Ian Gable24 Research and Commercial clouds made available with some cloud-like interface. Step 1

Ian Gable25 User submits to Condor Job scheduler that has no resources attached to it. Step 2

Credentials Movement User does a myproxy-logon User submits a condor job with proxy IaaS API request made to Nimbus with proxy Cluster service certificate is injected into the VM (could be replaced with user cert). VM boots and joins the condor pool using GSI authentication with the service certificate. Condor submits job to VM with delegated user proxy. Ian Gable26

Ian Gable27 Cloud Scheduler detects that there are waiting jobs in the Condor Queues and then makes request to boot the VMs that match the job requirements. Step 3

Ian Gable28 Step 4 The VMs boot, attach themselves to the Condor Queues and begin draining jobs. Once no more jobs require the VMs Cloud Scheduler shuts them down.

Ian Gable29 Credentials Step 1

Ian Gable30 Credentials Step 2

Ian Gable31 Credentials Step 3

Ian Gable32 Credentials Step 4

Ian Gable33 Credentials Step 5