Alain Roy Computer Sciences Department University of Wisconsin-Madison 25-June-2002 Using Condor on the Grid.

Slides:



Advertisements
Similar presentations
Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF.
Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.
Workload Management Massimo Sgaravatto INFN Padova.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Peter Couvares Computer Sciences Department University of Wisconsin-Madison High-Throughput Computing With.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
GRAM: Software Provider Forum Stuart Martin Computational Institute, University of Chicago & Argonne National Lab TeraGrid 2007 Madison, WI.
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
1 Getting popular Figure 1: Condor downloads by platform Figure 2: Known # of Condor hosts.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor-G: A Computation Management.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Alain Roy Computer Sciences Department University of Wisconsin-Madison Condor & Middleware: NMI & VDT.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G: Condor and Grid Computing.
Workload Management Workpackage
HTCondor Security Basics
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Building Grids with Condor
HTCondor Security Basics HTCondor Week, Madison 2016
Basic Grid Projects – Condor (Part I)
Condor-G Making Condor Grid Enabled
Condor-G: An Update.
Presentation transcript:

Alain Roy Computer Sciences Department University of Wisconsin-Madison 25-June-2002 Using Condor on the Grid

Добрый вечер! › Thank you for having me! › I am:  Alain Roy  Computer Science Ph.D. in Quality of Service, with Globus Project  Working with the Condor Project › This is the last of three Condor tutorials

Review: What is Condor? › Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing facility.  Run lots of jobs over a long period of time,  Not a short burst of “high-performance” › Condor manages both machines and jobs with ClassAd Matchmaking to keep everyone happy

Condor Takes Care of You › Condor does whatever it takes to run your jobs, even if some machines…  Crash (or are disconnected)  Run out of disk space  Don’t have your software installed  Are frequently needed by others  Are far away & managed by someone else

What is Unique about Condor? › ClassAds › Transparent checkpoint/restart › Remote system calls › Works in heterogeneous clusters › Clusters can be:  Dedicated  Opportunistic

What’s Condor Good For? › Managing a large number of jobs › Robustness  Checkpointing  Persistent Job Queue › Ability to access more resources › Flexible policies to control usage on your pool

A Bit of Condor Philosophy › Condor brings more computing to everyone  A small-time scientist can make an opportunistic pool with 10 machines, and get 10 times as much computing done.  A large collaboration can use Condor to control it’s dedicated pool with hundreds of machines.

Condor’s Idea Computing power is everywhere, we try to make it usable by anyone.

Condor and the Grid › The Grid provides:  Uniform, dependable, consistent, pervasive, and inexpensive computing. Hopefully. › Condor wants to make computing power usable by everyone

This Must Be a Match Made in Heaven! +

Remember Frieda? Today we’ll revisit Frieda’s Condor/Grid explorations in more depth

First, A Review of Globus › Globus isn’t “The Grid”, but it provides a lot of commonly used technologies for building Grids. › Globus is a toolkit: pick the pieces you wish to use › Globus implements standard Grid protocols and APIs

Globus Toolkit Pieces › Security: Grid Security Infrastructure › Resource Management: GRAM  Submit and monitor jobs › Information services › Data Transfer: GridFTP

Grid Security Infrastructure › Authentication and authorization › Certificate authorities › Single sign-on › Usually public-key authentication › Can work with Kerberos

Resource Management › Single method for submitting jobs › Multiple backends for running jobs  Fork  Condor  PBS/LSF/…

Information Services › LDAP-based  Easy to access with standard clients › Implements standard schemas for representing resources

Data Transfer › GridFTP  Uses GSI authentication  High-performance through parallel and striped transfers  Quickly becoming widely used

Where does Condor Fit In? › Condor back-end for GRAM  Submit Globus jobs  They run in your Condor pool › Condor-G submit to Globus resources  Provides reliability and monitoring beyond standard Globus mechanisms › Can be used together! › We’ll describe both of these.

Condor back-end for GRAM › GRAM uses job-manager to control jobs  Globus comes with Condor job manager  Easy to configure with setup-globus-gram- jobmanager › Users can configure Condor behavior with RSL when submitting jobs:  jobtype: configures universe (vanilla/standard)  Constructs Condor submit file and submits to Condor pool

I have 600 simulations to run. Where can I get help?

Frieda… › Installed personal Condor › Made a larger Condor pool › Added dedicated nodes › Added Grid resources › We talked about the first three steps in detail earlier.

Frieda Goes to the Grid! › First Frieda takes advantage of her Condor friends! › She knows people with their own Condor pools, and gets permission to access their resources flock › She then configures her Condor pool to “flock” to these pools

your workstation Friendly Condor Pool personal Condor 600 Condor jobs Condor Pool

How Flocking Works › Add a line to your condor_config : FLOCK_TO = Friendly-Pool FLOCK_FROM = Friedas-Pool Schedd Collector Negotiator Central Manager (CONDOR_HOST ) Collector Negotiator Friendly-Pool Central Manager Submit Machine

Condor Flocking › Remote pools are contacted in the order specified until jobs are satisfied › The list of remote pools is a property of the Schedd, not the Central Manager  Different users can Flock to different pools  Remote pools can allow specific users › User-priority system is “flocking-aware”  A pool’s local users can have priority over remote users “flocking” in.

Condor Flocking, cont. › Flocking is “Condor” specific technology… › Frieda also has access to Globus resources she wants to use  She has certificates and access to Globus gatekeepers at remote institutions › But Frieda wants Condor’s queue management features for her Globus jobs! › She installs Condor-G so she can submit “Globus Universe” jobs to Condor

Condor-G Installation: Tell it what you need…

… and watch it go!

Frieda Submits a Globus Universe Job › In her submit description file, she specifies:  Universe = Globus  Which Globus Gatekeeper to use  Optional: Location of file containing your Globus certificate universe = globus globusscheduler = beak.cs.wisc.edu/jobmanager executable = progname queue

How It Works Schedd LSF Personal CondorGlobus Resource

How It Works Schedd LSF Personal CondorGlobus Resource 600 Globus jobs

How It Works Schedd LSF Personal CondorGlobus Resource GridManager 600 Globus jobs

How It Works Schedd JobManager LSF Personal CondorGlobus Resource GridManager 600 Globus jobs

How It Works Schedd JobManager LSF User Job Personal CondorGlobus Resource GridManager 600 Globus jobs

Condor Globus Universe

Globus Universe Concerns › What about Fault Tolerance?  Local Crashes What if the submit machine goes down?  Network Outages What if the connection to the remote Globus jobmanager is lost?  Remote Crashes What if the remote Globus jobmanager crashes? What if the remote machine goes down?

New Fault Tolerance › Ability to restart a JobManager › Enhanced two-phase commit submit protocol › Donated by Condor project to Globus 2.0

Globus Universe Fault-Tolerance: Submit-side Failures › All relevant state for each submitted job is stored persistently in the Condor job queue. › This persistent information allows the Condor GridManager upon restart to read the state information and reconnect to JobManagers that were running at the time of the crash. › If a JobManager fails to respond…

Globus Universe Fault-Tolerance: Lost Contact with Remote Jobmanager Can we contact gatekeeper? Yes – network was down No – machine crashed or job completed Yes - jobmanager crashedNo – retry until we can talk to gatekeeper again… Can we reconnect to jobmanager? Has job completed? No – is job still running? Yes – update queue Restart jobmanager

Globus Universe Fault-Tolerance: Credential Management › Authentication in Globus is done with limited-lifetime X509 proxies › Proxy may expire before jobs finish executing › Condor can put jobs on hold and user to refresh proxy › Todo: Interface with MyProxy…

But Frieda Wants More… › She wants to run standard universe jobs on Globus-managed resources that aren’t running Condor  For matchmaking and dynamic scheduling of jobs  For job checkpointing and migration  For remote system calls

Solution: Condor GlideIn › Frieda can use the Globus Universe to run Condor daemons on Globus resources › When the resources run these GlideIn jobs, they will temporarily join her Condor Pool › She can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the Globus resources

How It Works Schedd LSF Collector Personal CondorGlobus Resource 600 Condor jobs

How It Works Schedd LSF Collector Personal CondorGlobus Resource 600 Condor jobs GlideIn jobs

How It Works Schedd LSF Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs

How It Works Schedd JobManager LSF Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs

How It Works Schedd JobManager LSF Startd Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs

How It Works Schedd JobManager LSF Startd Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs

How It Works Schedd JobManager LSF User Job Startd Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs

GlideIn Concerns › What if a Globus resource kills my GlideIn job?  That resource will disappear from your pool and your jobs will be rescheduled on other machines  Standard universe jobs will resume from their last checkpoint like usual › What if all my jobs are completed before a GlideIn job runs?  If a GlideIn Condor daemon is not matched with a job in 10 minutes, it terminates, freeing the resource

What Have We Done on the Grid Already? › NUG30 › USCMS Testbed

NUG30 › quadratic assignment problem › 30 facilities, 30 locations  minimize cost of transferring materials between them › posed in 1968 as challenge, long unsolved › but with a good pruning algorithm & high-throughput computing...

NUG30 Solved on the Grid with Condor + Globus Resource simultaneously utilized: › the Origin 2000 (through LSF ) at NCSA. › the Chiba City Linux cluster at Argonne › the SGI Origin 2000 at Argonne. › the main Condor pool at Wisconsin (600 processors) › the Condor pool at Georgia Tech (190 Linux boxes) › the Condor pool at UNM (40 processors) › the Condor pool at Columbia (16 processors) › the Condor pool at Northwestern (12 processors) › the Condor pool at NCSA (65 processors) › the Condor pool at INFN (200 processors)

NUG30—Number of Workers

NUG30 - Solved!!! Sender: Subject: Re: Let the festivities begin. Hi dear Condor Team, you all have been amazing. NUG30 required 10.9 years of Condor Time. In just seven days ! More stats tomorrow !!! We are off celebrating ! condor rules ! cheers, JP.

USCMS Testbed › Production of CMS data › Testbed has five sites across the US › Condor, Condor-G, Globus, GDMP… › A fantastic test environment for the Grid—the buck stops here!  Errors between systems, logging  Inetd confuses  Globus GASS cache tester

Questions? Comments? › Web: ›