Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF.

Similar presentations


Presentation on theme: "Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF."— Presentation transcript:

1 Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu http://www.cs.wisc.edu/condor OGF 19 Condor Software Forum Condor-G

2 www.cs.wisc.edu/condor What Is It? Condor-G is a specialization of Condor. It is also known as the grid universe. Condor-G speaks many different job management protocols. Condor-G benefits from all the wonderful Condor features, like a real job queue.

3 www.cs.wisc.edu/condor Grid Fault-Tolerance Condor-G does whatever it takes to run your jobs, even if … Your local machine machine crashes The grid service is temporarily unavailable The network goes down

4 www.cs.wisc.edu/condor Remote Resource Access: Globus globusrun myjob … Globus GRAM Protocol Globus JobManager fork() Organization A Organization B

5 www.cs.wisc.edu/condor Globus Globus GRAM Protocol Globus JobManager fork() Organization A Organization B globusrun myjob …

6 www.cs.wisc.edu/condor Globus + Condor Globus GRAM Protocol Globus JobManager Submit to Condor Condor Pool Organization A Organization B globusrun myjob …

7 www.cs.wisc.edu/condor Globus + Condor globusrun … Globus GRAM Protocol Globus JobManager Submit to Condor Condor Pool Organization A Organization B

8 www.cs.wisc.edu/condor Condor-G + Globus + Condor Globus GRAM Protocol Globus JobManager Submit to Condor Condor Pool Organization A Organization B Condor-G myjob1 myjob2 myjob3 myjob4 myjob5 …

9 www.cs.wisc.edu/condor Condor-G Fault-Tolerance: Lost Contact with Remote Jobmanager Can we contact gatekeeper? Yes – network was down No – machine crashed or job completed Yes - jobmanager crashedNo – retry until we can talk to gatekeeper again… Can we reconnect to jobmanager? Has job completed? No – is job still running? Yes – update queue Restart jobmanager

10 www.cs.wisc.edu/condor Just to be fair… The gatekeeper doesnt have to submit to a Condor pool. It could be PBS, LSF, Sun Grid Engine… Condor-G will work fine whatever the remote batch system is.

11 www.cs.wisc.edu/condor Other Condor-G Features Other Grid Protocols Works with WS-GRAM, NorduGrid, Unicore Credential Management Pull refreshed credentials from MyProxy Push refreshed credentials to remote systems Job Scheduling Use Matchmaking to select resources for jobs GlideIn Allows late binding of resources and job checkpoint/migration

12 www.cs.wisc.edu/condor Condor-G Condor-G Job Description (Job ClassAd) GT2 [.1|2|4] HTTPS CondorPBS/LSFNorduGrid GT4 WSRF Unicore

13 www.cs.wisc.edu/condor Pre-WS GRAM Submit file grid_resource = gt2 \ foo.edu/jobmanager-pbs globus_rsl = (queue=long)\ (condor_submit=(universe java))

14 www.cs.wisc.edu/condor OGSA GRAM Submit file grid_resource = gt3 http://foo.edu/\ ogsa/services/base/gram/\ PBSManagedJobFactoryService globus_rsl = (queue=long)\ (condor_submit=(universe java)) Museum mode

15 www.cs.wisc.edu/condor WS GRAM Submit file grid_resource = gt4 foo.edu PBS globus_xml = long

16 www.cs.wisc.edu/condor NorduGrid Submit file grid_resource = nordugrid foo.edu nordugrid_rsl = (queue=long)

17 www.cs.wisc.edu/condor Unicore Submit file grid_resource = unicore usite.org vsite keystore_file = keystore keystore_passphrase_file = keystore.pw keystore_alias = my cert

18 www.cs.wisc.edu/condor Condor Submit file grid_resource = condor schedd.foo.edu \ cm.foo.edu remote_universe = java

19 www.cs.wisc.edu/condor PBS Submit file grid_resource = pbs

20 www.cs.wisc.edu/condor LSF Submit file grid_resource = lsf

21 www.cs.wisc.edu/condor Grid Universe Fault-Tolerance: Credential Management Authentication in many grid protocols is done with limited-lifetime X509 proxies Proxy may expire before jobs finish executing Condor can put jobs on hold and email user to refresh proxy Condor can automatically retrieve new proxies from MyProxy When the proxy is refreshed, Condor forwards it to the jobs

22 www.cs.wisc.edu/condor MyProxy Submit file MyProxyHost = foo.edu:12345 MyProxyServerDN = /DC=org/DC=doegrids… MyProxyCredentialName = proxy_file MyProxyRefreshThreshold = 240 #mins MyProxyNewProxyLifetime = 12 #hrs MyProxyPassword = password Or give password on command line condor_submit -p password submit.desc

23 www.cs.wisc.edu/condor Condor-G Matchmaking Use Condor-G matchmaking with grid universe jobs Allows Condor-G to dynamically assign computing jobs to grid sites An example of lazy planning

24 www.cs.wisc.edu/condor Condor-G Matchmaking, cont. Normally a grid universe job must specify the site in the submit description file via the grid_resource attribute like so: Executable = foo Universe = grid Grid_Resource = gt2 \ beak.cs.wisc.edu/jobmanager-pbs queue

25 www.cs.wisc.edu/condor Condor-G Matchmaking, cont. With matchmaking, grid universe jobs can use requirements and rank: Executable = foo Universe = grid Grid_Resource = $$(ResourceName) Requirements = arch == LINUX Rank = NumberOfNodes * random() Queue The $$(x) syntax inserts information from the target ClassAd when a match is made.

26 www.cs.wisc.edu/condor Condor-G Matchmaking, cont. Where do these target ClassAds representing Globus gatekeepers come from? Several options: Simple script on gatekeeper publishes an ad via condor_advertise command-line utility (method used by D0 JIM, USCMS) Program to query Globus MDS and convert information into ClassAd (method used by EDG) Run HawkEye with appropriate plugins on the gatekeeper For explanation of Condor-G matchmaking setup for USCMS, see http://www.cs.wisc.edu/condor/USCMS_matchmaking.html

27 www.cs.wisc.edu/condor Condor-G Matchmaking: Creating the Resource Ad Machine Ad MyType = Machine TargetType = Job Name = foo.edu Machine = foo.edu ResourceName = gt4 foo.edu PBS UpdateSequenceNumber = 4 Requirements = TARGET.JobUniverse == 9 && \ CurMatches < 10 CurMatches = 0 NumberOfNodes = 300 Rank = 0.0 CurrentRank = 0.0 WantAdRevaluate = True

28 www.cs.wisc.edu/condor Condor-G Matchmaking: Creating the Resource Ad Advertising a resource condor_advertise UPDATE_STARTD_AD \ ad-file Call periodically Use unix time for UpdateSequenceNumber

29 www.cs.wisc.edu/condor But Wait, Theres More… What if you want to run standard universe jobs on grid resources For matchmaking and dynamic scheduling of jobs For job checkpointing and migration For remote system calls What if you dont want to send a job to a site until the moment the job will start running (late binding)

30 www.cs.wisc.edu/condor One Solution: Condor-G GlideIn You can use the Grid Universe to run Condor daemons on grid resources When the resources run these GlideIn jobs, they will temporarily join your Condor Pool You can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the grid resources

31 www.cs.wisc.edu/condor your workstation Friendly Condor Pool personal Condor 600 Condor jobs Globus Grid PBS LSF Condor Condor Pool glide-in jobs

32 www.cs.wisc.edu/condor GlideIn Concerns What if a grid resource kills my GlideIn job? That resource will disappear from your pool and your jobs will be rescheduled on other machines Standard universe jobs will resume from their last checkpoint like usual What if all my jobs are completed before a GlideIn job runs? If a GlideIn Condor daemon is not matched with a job in 10 minutes, it terminates, freeing the resource

33 www.cs.wisc.edu/condor Condor schedd (Job caretaker) condor_submit matchmaker Startd (Runs job)

34 www.cs.wisc.edu/condor Condor-G schedd (Job caretaker) condor_submit gridmanager gahpGlobus gatekeeper PBS or LSF

35 www.cs.wisc.edu/condor Condor-C schedd (Job caretaker) condor_submit gridmanager condor-gahpscheddmatchmaker startd

36 www.cs.wisc.edu/condor Condor-C to non-Condor schedd (Job caretaker) condor_submit gridmanager condor-gahpschedd gridmanager pbs/lsf-gahp PBS or LSF

37 www.cs.wisc.edu/condor Gliding in Condor-C schedd (Job caretaker) condor_submit gridmanager pbs/lsf-gahp PBS or LSF condor-gahpgahp Globus gatekeeper schedd 1. Glide-in 2. Submit jobs

38 www.cs.wisc.edu/condor Matchmaking with Condor-C In all of these examples, Condor-C went to a specific remote schedd This is not required: you can do matchmaking

39 www.cs.wisc.edu/condor Matchmaking with Condor-C schedd (Job caretaker) condor_submit gridmanager condor-gahpmatchmaker schedd … submit job

40 www.cs.wisc.edu/condor


Download ppt "Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF."

Similar presentations


Ads by Google