Alain Roy Computer Sciences Department University of Wisconsin-Madison 25-June-2002 Using Condor on the Grid
Добрый вечер! › Thank you for having me! › I am: Alain Roy Computer Science Ph.D. in Quality of Service, with Globus Project Working with the Condor Project › This is the last of three Condor tutorials
Review: What is Condor? › Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing facility. Run lots of jobs over a long period of time, Not a short burst of “high-performance” › Condor manages both machines and jobs with ClassAd Matchmaking to keep everyone happy
Condor Takes Care of You › Condor does whatever it takes to run your jobs, even if some machines… Crash (or are disconnected) Run out of disk space Don’t have your software installed Are frequently needed by others Are far away & managed by someone else
What is Unique about Condor? › ClassAds › Transparent checkpoint/restart › Remote system calls › Works in heterogeneous clusters › Clusters can be: Dedicated Opportunistic
What’s Condor Good For? › Managing a large number of jobs › Robustness Checkpointing Persistent Job Queue › Ability to access more resources › Flexible policies to control usage on your pool
A Bit of Condor Philosophy › Condor brings more computing to everyone A small-time scientist can make an opportunistic pool with 10 machines, and get 10 times as much computing done. A large collaboration can use Condor to control it’s dedicated pool with hundreds of machines.
Condor’s Idea Computing power is everywhere, we try to make it usable by anyone.
Condor and the Grid › The Grid provides: Uniform, dependable, consistent, pervasive, and inexpensive computing. Hopefully. › Condor wants to make computing power usable by everyone
This Must Be a Match Made in Heaven! +
Remember Frieda? Today we’ll revisit Frieda’s Condor/Grid explorations in more depth
First, A Review of Globus › Globus isn’t “The Grid”, but it provides a lot of commonly used technologies for building Grids. › Globus is a toolkit: pick the pieces you wish to use › Globus implements standard Grid protocols and APIs
Globus Toolkit Pieces › Security: Grid Security Infrastructure › Resource Management: GRAM Submit and monitor jobs › Information services › Data Transfer: GridFTP
Grid Security Infrastructure › Authentication and authorization › Certificate authorities › Single sign-on › Usually public-key authentication › Can work with Kerberos
Resource Management › Single method for submitting jobs › Multiple backends for running jobs Fork Condor PBS/LSF/…
Information Services › LDAP-based Easy to access with standard clients › Implements standard schemas for representing resources
Data Transfer › GridFTP Uses GSI authentication High-performance through parallel and striped transfers Quickly becoming widely used
Where does Condor Fit In? › Condor back-end for GRAM Submit Globus jobs They run in your Condor pool › Condor-G submit to Globus resources Provides reliability and monitoring beyond standard Globus mechanisms › Can be used together! › We’ll describe both of these.
Condor back-end for GRAM › GRAM uses job-manager to control jobs Globus comes with Condor job manager Easy to configure with setup-globus-gram- jobmanager › Users can configure Condor behavior with RSL when submitting jobs: jobtype: configures universe (vanilla/standard) Constructs Condor submit file and submits to Condor pool
I have 600 simulations to run. Where can I get help?
Frieda… › Installed personal Condor › Made a larger Condor pool › Added dedicated nodes › Added Grid resources › We talked about the first three steps in detail earlier.
Frieda Goes to the Grid! › First Frieda takes advantage of her Condor friends! › She knows people with their own Condor pools, and gets permission to access their resources flock › She then configures her Condor pool to “flock” to these pools
your workstation Friendly Condor Pool personal Condor 600 Condor jobs Condor Pool
How Flocking Works › Add a line to your condor_config : FLOCK_TO = Friendly-Pool FLOCK_FROM = Friedas-Pool Schedd Collector Negotiator Central Manager (CONDOR_HOST ) Collector Negotiator Friendly-Pool Central Manager Submit Machine
Condor Flocking › Remote pools are contacted in the order specified until jobs are satisfied › The list of remote pools is a property of the Schedd, not the Central Manager Different users can Flock to different pools Remote pools can allow specific users › User-priority system is “flocking-aware” A pool’s local users can have priority over remote users “flocking” in.
Condor Flocking, cont. › Flocking is “Condor” specific technology… › Frieda also has access to Globus resources she wants to use She has certificates and access to Globus gatekeepers at remote institutions › But Frieda wants Condor’s queue management features for her Globus jobs! › She installs Condor-G so she can submit “Globus Universe” jobs to Condor
Condor-G Installation: Tell it what you need…
… and watch it go!
Frieda Submits a Globus Universe Job › In her submit description file, she specifies: Universe = Globus Which Globus Gatekeeper to use Optional: Location of file containing your Globus certificate universe = globus globusscheduler = beak.cs.wisc.edu/jobmanager executable = progname queue
How It Works Schedd LSF Personal CondorGlobus Resource
How It Works Schedd LSF Personal CondorGlobus Resource 600 Globus jobs
How It Works Schedd LSF Personal CondorGlobus Resource GridManager 600 Globus jobs
How It Works Schedd JobManager LSF Personal CondorGlobus Resource GridManager 600 Globus jobs
How It Works Schedd JobManager LSF User Job Personal CondorGlobus Resource GridManager 600 Globus jobs
Condor Globus Universe
Globus Universe Concerns › What about Fault Tolerance? Local Crashes What if the submit machine goes down? Network Outages What if the connection to the remote Globus jobmanager is lost? Remote Crashes What if the remote Globus jobmanager crashes? What if the remote machine goes down?
New Fault Tolerance › Ability to restart a JobManager › Enhanced two-phase commit submit protocol › Donated by Condor project to Globus 2.0
Globus Universe Fault-Tolerance: Submit-side Failures › All relevant state for each submitted job is stored persistently in the Condor job queue. › This persistent information allows the Condor GridManager upon restart to read the state information and reconnect to JobManagers that were running at the time of the crash. › If a JobManager fails to respond…
Globus Universe Fault-Tolerance: Lost Contact with Remote Jobmanager Can we contact gatekeeper? Yes – network was down No – machine crashed or job completed Yes - jobmanager crashedNo – retry until we can talk to gatekeeper again… Can we reconnect to jobmanager? Has job completed? No – is job still running? Yes – update queue Restart jobmanager
Globus Universe Fault-Tolerance: Credential Management › Authentication in Globus is done with limited-lifetime X509 proxies › Proxy may expire before jobs finish executing › Condor can put jobs on hold and user to refresh proxy › Todo: Interface with MyProxy…
But Frieda Wants More… › She wants to run standard universe jobs on Globus-managed resources that aren’t running Condor For matchmaking and dynamic scheduling of jobs For job checkpointing and migration For remote system calls
Solution: Condor GlideIn › Frieda can use the Globus Universe to run Condor daemons on Globus resources › When the resources run these GlideIn jobs, they will temporarily join her Condor Pool › She can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the Globus resources
How It Works Schedd LSF Collector Personal CondorGlobus Resource 600 Condor jobs
How It Works Schedd LSF Collector Personal CondorGlobus Resource 600 Condor jobs GlideIn jobs
How It Works Schedd LSF Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs
How It Works Schedd JobManager LSF Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs
How It Works Schedd JobManager LSF Startd Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs
How It Works Schedd JobManager LSF Startd Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs
How It Works Schedd JobManager LSF User Job Startd Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs
GlideIn Concerns › What if a Globus resource kills my GlideIn job? That resource will disappear from your pool and your jobs will be rescheduled on other machines Standard universe jobs will resume from their last checkpoint like usual › What if all my jobs are completed before a GlideIn job runs? If a GlideIn Condor daemon is not matched with a job in 10 minutes, it terminates, freeing the resource
What Have We Done on the Grid Already? › NUG30 › USCMS Testbed
NUG30 › quadratic assignment problem › 30 facilities, 30 locations minimize cost of transferring materials between them › posed in 1968 as challenge, long unsolved › but with a good pruning algorithm & high-throughput computing...
NUG30 Solved on the Grid with Condor + Globus Resource simultaneously utilized: › the Origin 2000 (through LSF ) at NCSA. › the Chiba City Linux cluster at Argonne › the SGI Origin 2000 at Argonne. › the main Condor pool at Wisconsin (600 processors) › the Condor pool at Georgia Tech (190 Linux boxes) › the Condor pool at UNM (40 processors) › the Condor pool at Columbia (16 processors) › the Condor pool at Northwestern (12 processors) › the Condor pool at NCSA (65 processors) › the Condor pool at INFN (200 processors)
NUG30—Number of Workers
NUG30 - Solved!!! Sender: Subject: Re: Let the festivities begin. Hi dear Condor Team, you all have been amazing. NUG30 required 10.9 years of Condor Time. In just seven days ! More stats tomorrow !!! We are off celebrating ! condor rules ! cheers, JP.
USCMS Testbed › Production of CMS data › Testbed has five sites across the US › Condor, Condor-G, Globus, GDMP… › A fantastic test environment for the Grid—the buck stops here! Errors between systems, logging Inetd confuses Globus GASS cache tester
Questions? Comments? › Web: ›