Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona, 2006
2 Agenda Extended user’s tutorial Advanced Uses of Condor Java programs DAGMan Stork MW Grid Computing Case studies, and a discussion of your application‘s needs
3 Resources There are many resources (machines) in the world, and many are or can be made available! Groups of machines may be labeled as grids Welcome to the power of the grid !
4 Condor and Grids Condor has always been a tool to harness grid computing Condor’s mechanisms have evolved as technologies have evolved. Roughly categorized: Flocking Glidein The grid universe
5 Flocking A way for jobs to run within a different, separate Condor pool Condor runs here, and Condor runs there here there
6 Connect Condor Pools with Flocking Flocking is a Condor-specific technology Flocking is enabled with configuration Jobs flock from here to there when they cannot be run here due to lack of available machines
7 Configuration Configuration files contain lots of the administrative information used by Condor Format is like that in submit description files: AttributeName = Value
8 Configuration here For jobs to be able to flock from here to there In the configuration file on the pool where jobs flock from: FLOCK_TO = FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO) FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO) HOSTALLOW_NEGOTIATOR_SCHEDD = $(COLLECTOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
10 Submit Description File Enable file transfer: universe = vanilla executable = myjob.exe input = myjob.input output = myjob.output log = myjob.log should_transfer_files = YES when_to_transfer_output = ON_EXIT queue
11 The Glidein Concept Assume: We need more machines, and we have permission to use a set of machines Glidein temporarily adds a set of machines to the local pool
12 Glidein In addition, Glidein solves the problem: “My job needs to run on that particular resource, and my job needs Condor.” For example: a job that must run under the standard universe
13 Glidein Condor sends and runs its own executables on the resource The needed resource appears to temporarily join the local Condor pool !
14 Glidein run condor_glidein to add the remote resource to the local pool local pool remote resource the master and startd daemons become grid universe jobs using gt2
15 Making Glidein Work Change the configuration to give access permission ( HOSTALLOW_WRITE ) to the remote resource No changes to jobs’ submit description files! But, do enable file transfer in the submit description file: universe = vanilla executable = myjob.exe input = myjob.input output = myjob.output log = myjob.log should_transfer_files = YES when_to_transfer_output = ON_EXIT queue
16 Force Job to Glidein Resource In the submit description file: universe = standard executable = ajob.exe input = ajob.input output = ajob.output log = ajob.log requirements = \ ( machine == “" ) \ && Arch != "" && OpSys != "" queue
17 The Grid Universe Most useful when 1.We want to send a job off to a far away machine 2.We want to hand a job to another batch processing system on the local machine 3.We want to send a job off to a far away machine, in order to hand that job to another batch processing system on that machine
18 The Grid Universe All handled in the submit description file Supports several back end types: Globus: GT2, GT3, GT4 NorduGrid UNICORE Condor PBS LSF
19 Condor-G Condor-G describes jobs to be handed off to a machine, and the machine is utilizing Globus middleware gt 2: Globus Toolkit 1 or 2 or the pre-web services GRAM gt 3: Globus Toolkit 3 gt 4: Globus Toolkit 4 or WS GRAM
20 Submit Description File For gt2: universe = grid input = job1.input output = job1.result log = job1.log grid_resource = gt2 queue jobmanager jobmanager-condor jobmanager-pbs jobmanager-lsf jobmanager-sge One of:
21 For gt3: universe = grid input = job2.input output = job2.result log = job2.log grid_resource = gt3 /gram/XXXManagedJobFactoryService queue Submit Description File Fork Condor PBS LSF SGE XXX is one of: IP address:Port number
22 For gt4: universe = grid input = job3.input output = job3.result log = job3.log grid_resource = gt4 service/ManagedJobFactoryService XXX queue Submit Description File Fork Condor PBS LSF SGE XXX is one of: IP address:Port number OR Host name:Port number
23 Nordugrid and the Submit Description File universe = grid input = job4.input output = job4.result log = job4.log grid_resource = nordugrid queue
24 Unicore and the Submit Description File universe = grid input = job5.input output = job5.result log = job5.log grid_resource = unicore vsite keystore_file = /frieda/certificates/keystore keystore_alias = “frieda” keystore_passphrase_file = /frieda/private/passphrase queue vsite is the name of the Unicore virtual resource
25 PBS and the Submit Description File Details of the PBS installation in $(GLITE_LOCATION)/etc/batch_gahp.config universe = grid input = job6.input output = job6.result log = job6.log grid_resource = pbs queue
26 LSF and the Submit Description File Details of the LSF installation in $(GLITE_LOCATION)/etc/batch_gahp.config universe = grid input = job7.input output = job7.result log = job7.log grid_resource = lsf queue
27 Condor-C Condor is running here, and Condor is running over there For the case where We want to send a job off to a far away machine, in order to hand that job to another batch processing system on that machine
28 Condor-C and the Submit Description File universe = grid input = job8.input output = job8.result log = job8.log grid_resource = condor +remote_jobuniverse = 5 +remote_requirements = True +remote_ShouldTransferFiles = "YES" +remote_WhenToTransferOutput = "ON_EXIT" queue schedd name collector machine name vanilla universe
29 Credentials Not just anybody can use any resource at any time... Key concepts: Authentication verification of an identity Authorization permission to do something
30 Authentication If Frieda says “I am Frieda.”, how do we distinguish this from if Frieda says “I am George Bush.” ?
31 Authentication Bush can do whatever he pleases If Frieda claims to be Bush, (and this is accepted), then Frieda can do whatever she pleases Authentication attempts to verify the identity of the entity that is communicating
32 Authorization Who is allowed (permitted) to do what Frieda may run gt4 jobs on the Open Science Grid machines Fred may write to files in /usr/bin the Unix user root may do anything! Can be implemented with a list of those authorized
33 Condor and Authentication Authentication within Condor comes in many forms. Here are three. 1.File system: Have the entity write a file. The OS attaches a name to the file owner. Condor checks that the entity’s claim is the same as the file owner. 2.GSI (Grid Security Infrastructure) 3.Kerberos
34 Authentication Idea A centralized certificate authority (CA) does verification of an entity’s identity. When satisfied, the CA issues a signed certificate (also called a credential) I am Frieda CA
35 Authentication To authenticate, the entity presents the certificate All is well, if we trust the CA and the remote machine I am Frieda CA
36 GSI Authentication GSI uses X.509 certificates Grid universe, submitting to back end types using Globus middleware (gt2, gt3, gt4), as well as nordugrid, and unicore use X.509 certificates Condor can also use GSI
37 Revocation, Trust, and Proxies The CA may revoke a credential Frieda gives the signed credential to the remote machine. If the remote machine is malicious, it could impersonate Frieda. Therefore, a password protects the credential. A proxy is a credential that includes the password, but is only valid for a specific (short) time period. MyProxy software enables GSI proxy management