Download presentation
Presentation is loading. Please wait.
Published byDortha Lynch Modified over 9 years ago
1
Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu http://www.cs.wisc.edu/condor What’s New in Condor-G
2
www.cs.wisc.edu/condor Outline › What is Condor-G › Released New Features › In Development
3
www.cs.wisc.edu/condor What Is Condor-G › Use Condor to run jobs on the Grid › Uses Globus Toolkit GRAM (submit a remote job) GASS (transfer job’s files) › Two components Globus Universe GlideIn
4
www.cs.wisc.edu/condor Globus Universe › Run a job on a Grid resource › Features Job management Fault tolerance Credential management › Roughly equivalent to the vanilla universe
5
www.cs.wisc.edu/condor How It Works Schedd LSF Condor-GGrid Resource
6
www.cs.wisc.edu/condor How It Works Schedd LSF Condor-GGrid Resource 600 Globus jobs
7
www.cs.wisc.edu/condor How It Works Schedd LSF Condor-GGrid Resource GridManager 600 Globus jobs
8
www.cs.wisc.edu/condor How It Works Schedd JobManager LSF Condor-GGrid Resource GridManager 600 Globus jobs
9
www.cs.wisc.edu/condor How It Works Schedd JobManager LSF User Job Condor-GGrid Resource GridManager 600 Globus jobs
10
www.cs.wisc.edu/condor GlideIn › Run the Condor daemons on Grid resources as user jobs › Create your own personal Condor pool from temporarily-acquired Grid resources › Brings the full power of Condor to the Grid
11
www.cs.wisc.edu/condor Globus Grid PBS LSF Condor Condor-G
12
www.cs.wisc.edu/condor Globus Grid PBS LSF Condor 600 Condor jobs Condor-G
13
www.cs.wisc.edu/condor Condor-G Globus Grid PBS LSF Condor 600 Condor jobs
14
www.cs.wisc.edu/condor Condor-G Globus Grid PBS LSF Condor glide-ins 600 Condor jobs
15
www.cs.wisc.edu/condor Condor-G Globus Grid PBS LSF Condor glide-ins 600 Condor jobs
16
www.cs.wisc.edu/condor Condor-G Globus Grid PBS LSF Condor glide-ins 600 Condor jobs
17
www.cs.wisc.edu/condor Condor-G Globus Grid PBS LSF Condor glide-ins 600 Condor jobs
18
www.cs.wisc.edu/condor Released New Features › Stuff we’ve added in the past year › Released and ready for use in Condor 6.6
19
www.cs.wisc.edu/condor Globus ASCII Helper Protocol (GAHP) › Encapsulates Globus libraries in separate process › Simple ASCII protocol › Easy for legacy applications to use Globus when they can’t link directly with the libraries
20
www.cs.wisc.edu/condor How It Works - GAHP Schedd JobManager Condor-GGrid Resources GridManager JobManager GAHP Client GAHP Server
21
www.cs.wisc.edu/condor File Staging › Arbitrary input and output files can be staged to and from execution site › Same syntax as other universes › Limitation Output files must be explicitly named
22
www.cs.wisc.edu/condor File Staging (cont) › Input, Output, and Error can be URLs Files will be transferred directly to and from execution site › Output and Error can be staged or streamed
23
www.cs.wisc.edu/condor Credential Refresh › Renewed credentials are used by Condor-G and forwarded to the execution site automatically › No processes need to be restarted
24
www.cs.wisc.edu/condor Better Credential Management › One GridManager process can handle multiple credential files with same subject › More efficient when you want to have different credential lifetimes for different jobs
25
www.cs.wisc.edu/condor Grid Match-Making › Globus jobs matched with Globus resources by the Condor match- maker using ClassAds › Current limitation User/admin must create resources ads
26
www.cs.wisc.edu/condor Fault Tolerance › Condor-G does its best to automatically recover from failures › User can guide decisions with job policy expressions Periodic Release GlobusResubmit Rematch
27
www.cs.wisc.edu/condor PeriodicRelease Expression › Condor-G puts problematic jobs on hold › This expression tells Condor-G when to release and retry such jobs
28
www.cs.wisc.edu/condor GlobusResubmit Expression › Tells Condor-G when a problematic job submission should be abandoned › When this expression becomes true Best effort is made to clean up current job submission New job submission is attempted
29
www.cs.wisc.edu/condor Rematch Expression › Tells Condor-G when a problematic resource should be abandoned › Evaluated when GlobusResubmit evaluates to true › When this expression becomes true Best effort is made to clean up current job submission Job is rematched
30
www.cs.wisc.edu/condor Job Ad Example GlobusContactString = TARGET.gatekeeper_url Requirements = TARGET.Arch == “LINUX” && TARGET.OpSys == “LINUX” Rank = TARGET.Mflops PeriodicRelease = ((NumMatches 600)) GlobusResubmit = NumSystemHolds >= NumMatches Rematch = True
31
www.cs.wisc.edu/condor Hardening › Regular testing on the CMS testbed with real applications › Many bugs and integration issues found and fixed Hostile Environment
32
www.cs.wisc.edu/condor Hostile Environment › Full disks › Machine crashes › File server lock-ups › Network outages › Power outages
33
www.cs.wisc.edu/condor One CMS Dataset Run › 300 jobs › Last fall ~50 (16%) of the jobs stalled and required human recovery Multiple service restarts (20 daemon crashes over 6 hours) › Now 0 jobs stalled 0 service restarts
34
www.cs.wisc.edu/condor Integration Work › Dozens of Condor-G improvements and bug fixes › Over 40 Globus “bugzilla” incidents, many with patches Globus 2.2.4 has 21 “Advisories” as of 4/11/04 › Use latest version of both
35
www.cs.wisc.edu/condor Scalability › Submitting several hundred jobs produced high load on server Machine became unresponsive We saw a load average of 1000 at one point › Caused Globus JobManager processes
36
www.cs.wisc.edu/condor Grid Manager Monitor Agent › New tool Condor-G can use to reduce this load › Efficient job status polling program › Allows Condor-G to shut down JobManager processes when they’re not needed
37
www.cs.wisc.edu/condor Load Reduced › 400 jobs (/bin/sleep 900) › Without Grid Monitor 42 hours to complete Peak load average of 610 › With Grid Monitor 40 minutes Peak load average of 104
38
www.cs.wisc.edu/condor Miscellaneous Stuff › Email notification on job completion › Port range restrictions › Problem jobs put on hold
39
www.cs.wisc.edu/condor In Development › Stuff we’re currently working on › Will be released sometime in the next year
40
www.cs.wisc.edu/condor Job Policy Expressions › PeriodicHold › PeriodicRemove › OnExitHold › OnExitRemove
41
www.cs.wisc.edu/condor Improved GlideIn › MDS use optional User specifies necessary information › Automatic setup GlideIn job transfers and installs binaries if needed Binaries can come from submit machine
42
www.cs.wisc.edu/condor New Job Types › Submit jobs directly to other schedulers (not through Globus) › Why? Richer interface semantics Not supported by Globus
43
www.cs.wisc.edu/condor NorduGrid › Grid batch system designed by Nordic countries › Globus GRAM didn’t offer necessary semantics Client control of file staging Automatic cleanup of abandoned jobs
44
www.cs.wisc.edu/condor Oracle › Oracle DBMS supports a job queue Run this query in 5 hours Run this query every Monday › Condor can add more management features
45
www.cs.wisc.edu/condor Generic Job Interface › Re-arrange GridManager to allow easy addition of new job types › Define appropriate interface › Plug-ins for new job types?
46
www.cs.wisc.edu/condor Globus Toolkit 3.0 › OGSA (Open Grid Services Architecture) › Submit jobs to GT3 sites › Grid Service client interface to Condor-G
47
www.cs.wisc.edu/condor Miscellaneous › Condor-G for Windows › MyProxy credential management › URLs for executable, staged files
48
www.cs.wisc.edu/condor Thank You! › Questions? › Also… Condor-G & Globus Q/A session Wednesday, 9am-12pm, room TBA E-mail condor-admin@cs.wisc.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.