Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed Job Delegation
Job Delegation › Transfer of responsibility to schedule and execute a job › Multiple delegations can form a chain
Job Delegation in Condor-G Today Condor-G Globus GRAM Batch System Front-end Execute Machine
Expanding the Model › What can we do with new forms of job delegation? › Some ideas Mirroring Load-balancing Glide-in schedd Multi-hop grid scheduling
Mirroring › What it does Jobs mirrored on two Condor-Gs If primary Condor-G crashes, secondary one starts running jobs On recovery, primary Condor-G gets job status from secondary one › Removes Condor-G submit point as single point of failure
Mirroring Example Condor-G 1 Matchmaker Execute Machine Condor-G 2
Mirroring Example Condor-G 1 Matchmaker Execute Machine Condor-G 2
Load-Balancing › What it does Front-end Condor-G distributes all jobs among several back-end Condor-Gs Front-end Condor-G keeps updated job status › Improves scalability › Maintains single submit point for users
Load-Balancing Example Condor-G Back-end 1 Condor-G Front-end Condor-G Back-end 3 Condor-G Back-end 2
Glide-In Schedd › What it does Drop a Condor-G onto the front-end machine of a cluster Delegate jobs to the cluster through the glide-in schedd › Apply cluster-specific policies to jobs
Glide-In Schedd Example Condor-G Glide-In Schedd Batch System
Multi-Hop Grid Scheduling › Match a job to a Virtual Organization (VO), then to a resource within that VO › Easier to schedule jobs across multiple VOs and grids
Multi-Hop Grid Scheduling Example Experiment Condor-G Experiment Resource Broker VO Condor-G VO Resource Broker Globus GRAM Batch Scheduler
Endless Possibilities › These new models can be combined with each other or with other new models › Resulting system can be arbitrarily sophisticated
Job Delegation Challenges › New complexity introduces new issues and exacerbates existing ones › A few… Transparency Representation Scheduling Control Active Job Control Revocation Error Handling and Debugging
Transparency › Full information about job should be available to user Information from full delegation path No manual tracing across multiple machines › Users need to know what’s happening with their jobs
Representation › Job state is a vector › How best to show this to user Summary Current delegation endpoint Job state at endpoint Full information available if desired Series of nested ClassAds?
Scheduling Control › Avoid loops in delegation path › Give user control of scheduling Allow limiting of delegation path length? Allow user to specify part or all of delegation path
Active Job Control › User may request certain actions hold, suspend, vacate, checkpoint › Actions cannot be completed synchronously for user Must forward along delegation path User checks completion later
Active Job Control (cont) › Endpoint systems may not support actions If possible, execute them at furthest point that does support them › Allow user to apply action in middle of delegation path
Revocation › Leases Lease must be renewed periodically for delegation to remain valid Allows revocation during long-term failures › What are good values for lease lifetime and update interval?
Error Handling and Debugging › Many more places for things to go horribly wrong › Need clear, simple error semantics › Logs, logs, logs Have them everywhere
Current Status › Done Mirroring › In Progress Condor-G -> Condor-G delegation User must specify hops Glide-in schedd Set up by hand
Thank You! › Questions?