Download presentation
Presentation is loading. Please wait.
1
Alain Roy Computer Sciences Department University of Wisconsin-Madison roy@cs.wisc.edu http://www.cs.wisc.edu/condor 25-June-2002 Using Condor on the Grid
2
www.cs.wisc.edu/condor Добрый вечер! › Thank you for having me! › I am: Alain Roy Computer Science Ph.D. in Quality of Service, with Globus Project Working with the Condor Project › This is the last of three Condor tutorials
3
www.cs.wisc.edu/condor Review: What is Condor? › Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing facility. Run lots of jobs over a long period of time, Not a short burst of “high-performance” › Condor manages both machines and jobs with ClassAd Matchmaking to keep everyone happy
4
www.cs.wisc.edu/condor Condor Takes Care of You › Condor does whatever it takes to run your jobs, even if some machines… Crash (or are disconnected) Run out of disk space Don’t have your software installed Are frequently needed by others Are far away & managed by someone else
5
www.cs.wisc.edu/condor What is Unique about Condor? › ClassAds › Transparent checkpoint/restart › Remote system calls › Works in heterogeneous clusters › Clusters can be: Dedicated Opportunistic
6
www.cs.wisc.edu/condor What’s Condor Good For? › Managing a large number of jobs › Robustness Checkpointing Persistent Job Queue › Ability to access more resources › Flexible policies to control usage on your pool
7
www.cs.wisc.edu/condor A Bit of Condor Philosophy › Condor brings more computing to everyone A small-time scientist can make an opportunistic pool with 10 machines, and get 10 times as much computing done. A large collaboration can use Condor to control it’s dedicated pool with hundreds of machines.
8
www.cs.wisc.edu/condor Condor’s Idea Computing power is everywhere, we try to make it usable by anyone.
9
www.cs.wisc.edu/condor Condor and the Grid › The Grid provides: Uniform, dependable, consistent, pervasive, and inexpensive computing. Hopefully. › Condor wants to make computing power usable by everyone
10
www.cs.wisc.edu/condor This Must Be a Match Made in Heaven! +
11
www.cs.wisc.edu/condor Remember Frieda? Today we’ll revisit Frieda’s Condor/Grid explorations in more depth
12
www.cs.wisc.edu/condor First, A Review of Globus › Globus isn’t “The Grid”, but it provides a lot of commonly used technologies for building Grids. › Globus is a toolkit: pick the pieces you wish to use › Globus implements standard Grid protocols and APIs
13
www.cs.wisc.edu/condor Globus Toolkit Pieces › Security: Grid Security Infrastructure › Resource Management: GRAM Submit and monitor jobs › Information services › Data Transfer: GridFTP
14
www.cs.wisc.edu/condor Grid Security Infrastructure › Authentication and authorization › Certificate authorities › Single sign-on › Usually public-key authentication › Can work with Kerberos
15
www.cs.wisc.edu/condor Resource Management › Single method for submitting jobs › Multiple backends for running jobs Fork Condor PBS/LSF/…
16
www.cs.wisc.edu/condor Information Services › LDAP-based Easy to access with standard clients › Implements standard schemas for representing resources
17
www.cs.wisc.edu/condor Data Transfer › GridFTP Uses GSI authentication High-performance through parallel and striped transfers Quickly becoming widely used
18
www.cs.wisc.edu/condor Where does Condor Fit In? › Condor back-end for GRAM Submit Globus jobs They run in your Condor pool › Condor-G submit to Globus resources Provides reliability and monitoring beyond standard Globus mechanisms › Can be used together! › We’ll describe both of these.
19
www.cs.wisc.edu/condor Condor back-end for GRAM › GRAM uses job-manager to control jobs Globus comes with Condor job manager Easy to configure with setup-globus-gram- jobmanager › Users can configure Condor behavior with RSL when submitting jobs: jobtype: configures universe (vanilla/standard) Constructs Condor submit file and submits to Condor pool
20
www.cs.wisc.edu/condor I have 600 simulations to run. Where can I get help?
21
www.cs.wisc.edu/condor Frieda… › Installed personal Condor › Made a larger Condor pool › Added dedicated nodes › Added Grid resources › We talked about the first three steps in detail earlier.
22
www.cs.wisc.edu/condor Frieda Goes to the Grid! › First Frieda takes advantage of her Condor friends! › She knows people with their own Condor pools, and gets permission to access their resources flock › She then configures her Condor pool to “flock” to these pools
23
www.cs.wisc.edu/condor your workstation Friendly Condor Pool personal Condor 600 Condor jobs Condor Pool
24
www.cs.wisc.edu/condor How Flocking Works › Add a line to your condor_config : FLOCK_TO = Friendly-Pool FLOCK_FROM = Friedas-Pool Schedd Collector Negotiator Central Manager (CONDOR_HOST ) Collector Negotiator Friendly-Pool Central Manager Submit Machine
25
www.cs.wisc.edu/condor Condor Flocking › Remote pools are contacted in the order specified until jobs are satisfied › The list of remote pools is a property of the Schedd, not the Central Manager Different users can Flock to different pools Remote pools can allow specific users › User-priority system is “flocking-aware” A pool’s local users can have priority over remote users “flocking” in.
26
www.cs.wisc.edu/condor Condor Flocking, cont. › Flocking is “Condor” specific technology… › Frieda also has access to Globus resources she wants to use She has certificates and access to Globus gatekeepers at remote institutions › But Frieda wants Condor’s queue management features for her Globus jobs! › She installs Condor-G so she can submit “Globus Universe” jobs to Condor
27
Condor-G Installation: Tell it what you need…
28
… and watch it go!
29
www.cs.wisc.edu/condor Frieda Submits a Globus Universe Job › In her submit description file, she specifies: Universe = Globus Which Globus Gatekeeper to use Optional: Location of file containing your Globus certificate universe = globus globusscheduler = beak.cs.wisc.edu/jobmanager executable = progname queue
30
www.cs.wisc.edu/condor How It Works Schedd LSF Personal CondorGlobus Resource
31
www.cs.wisc.edu/condor How It Works Schedd LSF Personal CondorGlobus Resource 600 Globus jobs
32
www.cs.wisc.edu/condor How It Works Schedd LSF Personal CondorGlobus Resource GridManager 600 Globus jobs
33
www.cs.wisc.edu/condor How It Works Schedd JobManager LSF Personal CondorGlobus Resource GridManager 600 Globus jobs
34
www.cs.wisc.edu/condor How It Works Schedd JobManager LSF User Job Personal CondorGlobus Resource GridManager 600 Globus jobs
35
Condor Globus Universe
36
www.cs.wisc.edu/condor Globus Universe Concerns › What about Fault Tolerance? Local Crashes What if the submit machine goes down? Network Outages What if the connection to the remote Globus jobmanager is lost? Remote Crashes What if the remote Globus jobmanager crashes? What if the remote machine goes down?
37
www.cs.wisc.edu/condor New Fault Tolerance › Ability to restart a JobManager › Enhanced two-phase commit submit protocol › Donated by Condor project to Globus 2.0
38
www.cs.wisc.edu/condor Globus Universe Fault-Tolerance: Submit-side Failures › All relevant state for each submitted job is stored persistently in the Condor job queue. › This persistent information allows the Condor GridManager upon restart to read the state information and reconnect to JobManagers that were running at the time of the crash. › If a JobManager fails to respond…
39
www.cs.wisc.edu/condor Globus Universe Fault-Tolerance: Lost Contact with Remote Jobmanager Can we contact gatekeeper? Yes – network was down No – machine crashed or job completed Yes - jobmanager crashedNo – retry until we can talk to gatekeeper again… Can we reconnect to jobmanager? Has job completed? No – is job still running? Yes – update queue Restart jobmanager
40
www.cs.wisc.edu/condor Globus Universe Fault-Tolerance: Credential Management › Authentication in Globus is done with limited-lifetime X509 proxies › Proxy may expire before jobs finish executing › Condor can put jobs on hold and email user to refresh proxy › Todo: Interface with MyProxy…
41
www.cs.wisc.edu/condor But Frieda Wants More… › She wants to run standard universe jobs on Globus-managed resources that aren’t running Condor For matchmaking and dynamic scheduling of jobs For job checkpointing and migration For remote system calls
42
www.cs.wisc.edu/condor Solution: Condor GlideIn › Frieda can use the Globus Universe to run Condor daemons on Globus resources › When the resources run these GlideIn jobs, they will temporarily join her Condor Pool › She can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the Globus resources
43
www.cs.wisc.edu/condor How It Works Schedd LSF Collector Personal CondorGlobus Resource 600 Condor jobs
44
www.cs.wisc.edu/condor How It Works Schedd LSF Collector Personal CondorGlobus Resource 600 Condor jobs GlideIn jobs
45
www.cs.wisc.edu/condor How It Works Schedd LSF Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs
46
www.cs.wisc.edu/condor How It Works Schedd JobManager LSF Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs
47
www.cs.wisc.edu/condor How It Works Schedd JobManager LSF Startd Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs
48
www.cs.wisc.edu/condor How It Works Schedd JobManager LSF Startd Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs
49
www.cs.wisc.edu/condor How It Works Schedd JobManager LSF User Job Startd Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs
50
www.cs.wisc.edu/condor
51
GlideIn Concerns › What if a Globus resource kills my GlideIn job? That resource will disappear from your pool and your jobs will be rescheduled on other machines Standard universe jobs will resume from their last checkpoint like usual › What if all my jobs are completed before a GlideIn job runs? If a GlideIn Condor daemon is not matched with a job in 10 minutes, it terminates, freeing the resource
52
www.cs.wisc.edu/condor What Have We Done on the Grid Already? › NUG30 › USCMS Testbed
53
www.cs.wisc.edu/condor NUG30 › quadratic assignment problem › 30 facilities, 30 locations minimize cost of transferring materials between them › posed in 1968 as challenge, long unsolved › but with a good pruning algorithm & high-throughput computing...
54
www.cs.wisc.edu/condor NUG30 Solved on the Grid with Condor + Globus Resource simultaneously utilized: › the Origin 2000 (through LSF ) at NCSA. › the Chiba City Linux cluster at Argonne › the SGI Origin 2000 at Argonne. › the main Condor pool at Wisconsin (600 processors) › the Condor pool at Georgia Tech (190 Linux boxes) › the Condor pool at UNM (40 processors) › the Condor pool at Columbia (16 processors) › the Condor pool at Northwestern (12 processors) › the Condor pool at NCSA (65 processors) › the Condor pool at INFN (200 processors)
55
www.cs.wisc.edu/condor NUG30—Number of Workers
56
www.cs.wisc.edu/condor NUG30 - Solved!!! Sender: goux@dantec.ece.nwu.edu Subject: Re: Let the festivities begin. Hi dear Condor Team, you all have been amazing. NUG30 required 10.9 years of Condor Time. In just seven days ! More stats tomorrow !!! We are off celebrating ! condor rules ! cheers, JP.
57
www.cs.wisc.edu/condor USCMS Testbed › Production of CMS data › Testbed has five sites across the US › Condor, Condor-G, Globus, GDMP… › A fantastic test environment for the Grid—the buck stops here! Errors between systems, logging Inetd confuses Globus GASS cache tester
58
www.cs.wisc.edu/condor Questions? Comments? › Web: www.cs.wisc.edu/condor › Email: condor-admin@cs.wisc.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.