The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas MTA SZTAKI.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Validata Release Coordinator Accelerated application delivery through automated end-to-end release management.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Supporting MPI Applications on EGEE Grids Zoltán Farkas MTA SZTAKI.
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Installing and running COMSOL on a Windows HPCS2008(R2) cluster
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide Configuring Resources for the Grid Jerry Perez.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Grid Computing I CONDOR.
1 Overview of the Application Hosting Environment Stefan Zasada University College London.
GridShell + Condor How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner Edward Walker Miron Livney Todd Tannenbaum The Condor Development Team.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Grid job submission using HTCondor Andrew Lahiff.
Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Review of Condor,SGE,LSF,PBS
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Virtual Private Grid (VPG) : A Command Shell for Utilizing Remote Machines Efficiently Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa Department of Computer.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Weekly Work Dates:2010 8/20~8/25 Subject:Condor C.Y Hsieh.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
Workload Management Workpackage
OpenPBS – Distributed Workload Management System
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Workload Management System
Building Grids with Condor
Using the Parallel Universe beyond MPI
Condor Glidein: Condor Daemons On-The-Fly
Basic Grid Projects – Condor (Part I)
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Wide Area Workload Management Work Package DATAGRID project
Condor-G Making Condor Grid Enabled
Condor-G: An Update.
Presentation transcript:

The Glidein Service Gideon Juve

What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from remote Grid sites 1.Grid jobs (called “glideins”) are submitted to grid site using normal mechanisms (Globus, etc.) 2.Glideins start Condor worker daemons on remote resources 3.Glidein workers join user’s Condor pool and are used to run application jobs

How glideins work Condor Central Manager User Head Node Worker Nodes Globus LRM Condor LOCAL SITEGRID SITE To use glideins, the user runs a Condor central manager on a local machine that they control. This Condor pool will manage glidein resources allocated from a remote grid site.

How glideins work Condor Central Manager User Head Node Worker Nodes Submit Glidein Job Globus LRM Condor LOCAL SITEGRID SITE The user submits a job to the grid site. This job (called a “glidein”) will start Condor on the remote worker nodes. It is assumed that the Condor worker executables are pre-installed at the site.

How glideins work Condor Central Manager User Head Node Worker Nodes Start Glidein master startd Globus LRM Condor LOCAL SITEGRID SITE The glidein job configures and starts the Condor worker daemons on the grid site’s worker nodes.

How glideins work Condor Central Manager User Head Node Worker Nodes Contact Central Manager master startd Globus LRM Condor LOCAL SITEGRID SITE The Condor daemons contact the user’s central manager and become part of the user’s Condor pool.

How glideins work Condor Central Manager User Head Node Worker Nodes Submit Application Job master startd Globus LRM Condor LOCAL SITEGRID SITE The user submits application jobs to their Condor pool. The jobs are matched with glidein resources.

How glideins work Condor Central Manager User Head Node Worker Nodes Run Application Job master startd Globus LRM Condor job LOCAL SITEGRID SITE The application jobs are dispatched to the worker nodes for execution. Multiple application jobs can be executed on a single glidein resource.

What are glideins good for? Running short jobs on the grid –Condor can dispatch jobs faster than Globus Bypassing site scheduling policies –Max submitted/running jobs –Priority for large jobs Avoiding competition for resources –Glideins reserve resources for multiple jobs Reducing load on head node/LRM –Fewer Globus jobmanagers polling for status

Other Approaches Advance Reservations –Ask the scheduler for exclusive access to resources –Not supported at many sites –Typically managed by site administrator (not users) –Users are charged a premium for resources –Unused reservations cannot be returned Task Clustering –Group multiple, independent jobs together –Can delay the release of some jobs –May reduce parallelism Not mutually exclusive –Can use a combination of techniques

Glidein Service GT4 grid service for running glideins –Automates the installation and configuration of Condor on grid site –Simplifies the complex setup and configuration required to run glideins Separate setup and provisioning steps 1.Create “sites” for remote installation and setup of Condor executables 2.Create “glideins” for resource allocation

How the Glidein Service works Condor Central Manager User Head Node Worker Nodes Globus LRM Condor LOCAL SITEGRID SITE Glidein Service Globus Container The setup is similar to regular glideins. In addition to the Condor central manager, the user runs a Globus container that hosts the Glidein Service.

How the Glidein Service works Condor Central Manager User Head Node Worker Nodes Globus LRM Condor Create Site Glidein Service Globus Container LOCAL SITEGRID SITE The user creates a new site by sending a request to the Glidein Service. The request specifies details about the grid site including: job submission information, file system paths, etc.

How the Glidein Service works Condor Central Manager User Head Node Worker Nodes Globus LRM Condor Install & Configure Condor C Glidein Service Globus Container LOCAL SITEGRID SITE The Glidein Service installs Condor executables on a shared file system at the grid site. The appropriate executables are automatically selected and downloaded from a central repository based on architecture, and operating system.

How the Glidein Service works Condor Central Manager User Head Node Worker Nodes Globus LRM Condor Create Glidein C LOCAL SITEGRID SITE Glidein Service Globus Container The user allocates resources by submitting glidein requests to the Glidein Service. The request specifies the number of hosts/CPUs to acquire and the duration of the reservation.

How the Glidein Service works Condor Central Manager User Head Node Worker Nodes Globus LRM Condor Submit Glidein Job C LOCAL SITEGRID SITE Glidein Service Globus Container The Glidein Service translates the user’s request into a glidein job, which is submitted to the grid site.

How the Glidein Service works Condor Central Manager User Head Node Worker Nodes Globus LRM Condor C master startd Start Glidein LOCAL SITEGRID SITE Glidein Service Globus Container The glidein job starts Condor daemons on the worker nodes of the grid site.

How the Glidein Service works Condor Central Manager User Head Node Worker Nodes Globus LRM Condor C master startd Contact Central Manager LOCAL SITEGRID SITE Glidein Service Globus Container The Condor daemons contact the user’s Condor central manager and become part of the user’s resource pool.

How the Glidein Service works Condor Central Manager User Head Node Worker Nodes Globus LRM Condor C master startd Submit Application Job LOCAL SITEGRID SITE Glidein Service Globus Container The user submits application jobs to their Condor pool. The jobs are matched with available glidein resources.

How the Glidein Service works Condor Central Manager User Head Node Worker Nodes Globus LRM Condor C master startd Run Application Job job LOCAL SITEGRID SITE Glidein Service Globus Container The application jobs are dispatched to the remote workers for execution.

How the Glidein Service works Condor Central Manager User Head Node Worker Nodes Globus LRM Condor C master startd Cancel Glidein LOCAL SITEGRID SITE Glidein Service Globus Container When the user is finished with their application they cancel their glidein request (or let it expire automatically).

How the Glidein Service works Condor Central Manager User Head Node Worker Nodes Globus LRM Condor C Cancel Glidein Job LOCAL SITEGRID SITE Glidein Service Globus Container When the glidein is cancelled the Glidein Service cancels the glidein job, which causes the Condor daemons on the worker nodes to un-register themselves with the user’s Condor pool and exit. Un-register

How the Glidein Service works Condor Central Manager User Head Node Worker Nodes Globus LRM Condor C Remove Site LOCAL SITEGRID SITE Glidein Service Globus Container The user can submit multiple glidein requests for a single site. When the user is done with the site they ask the Glidein Service to remove the site.

How the Glidein Service works Condor Central Manager User Head Node Worker Nodes Globus LRM Condor Uninstall Condor LOCAL SITEGRID SITE Glidein Service Globus Container Finally, the Glidein Service removes Condor from the shared file system at the grid site. This removes all executables, config files, and logs.

Features Auto-configuration –Detect architecture, OS, glibc -> Condor package –Determine public IP –Generate Condor config file Multiple interfaces –Command-line –SOAP –Java API Automatic resubmission of glideins –Indefinitely, N times, until date/time Notifications History tracking

Challenges Firewalls / Private IPs –Hinder communication between glideins and user’s pool –Potential solutions GCB VPN? Run central manager on head node (restrictions) Shared File System –Currently required for Glidein Service to stage Condor executables

Command Line Example $ glidein create-site --site-name mercury --condor-version \ --environment GLOBUS_LOCATION=/usr/local/globus r3 \ --install-path /home/ac/juve/glidein \ --local-path /gpfs_scratch1/juve/glidein \ --staging-service 'gt2 grid-hg.ncsa.teragrid.org/jobmanager-fork’ \ --glidein-service 'gt2 grid-hg.ncsa.teragrid.org/jobmanager-pbs’ $ glidein list-site ID NAME CREATED LAST UPDATE STATE MESSAGE 3 mercury : :17 READY Installed $ glidein create-glidein --site 3 --count 1 --host-count 1 --num-cpus 1 \ --wall-time 30 --condor-host juve.isi.edu $ glidein list-glidein ID SITE CONDOR HOST SLOTS WTIME CREATED LAST UPDATE STATE MESSAGE 1 mercury (3) juve.isi.edu : :24 RUNNING Running $ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime tg-c408.ncsa.terag LINUX IA64 Unclaimed Idle :00:05 Total Owner Claimed Unclaimed Matched Preempting Backfill IA64/LINUX Total