Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.

Similar presentations


Presentation on theme: "Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed."— Presentation transcript:

1 Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu http://www.cs.wisc.edu/condor Condor-G: A Case in Distributed Job Delegation

2 www.cs.wisc.edu/condor Job Delegation › Transfer of responsibility to schedule and execute a job › Multiple delegations can form a chain

3 www.cs.wisc.edu/condor Job Delegation in Condor-G Today Condor-G Globus GRAM Batch System Front-end Execute Machine

4 www.cs.wisc.edu/condor Expanding the Model › What can we do with new forms of job delegation? › Some ideas  Mirroring  Load-balancing  Glide-in schedd  Multi-hop grid scheduling

5 www.cs.wisc.edu/condor Mirroring › What it does  Jobs mirrored on two Condor-Gs  If primary Condor-G crashes, secondary one starts running jobs  On recovery, primary Condor-G gets job status from secondary one › Removes Condor-G submit point as single point of failure

6 www.cs.wisc.edu/condor Mirroring Example Condor-G 1 Matchmaker Execute Machine Condor-G 2

7 www.cs.wisc.edu/condor Mirroring Example Condor-G 1 Matchmaker Execute Machine Condor-G 2

8 www.cs.wisc.edu/condor Load-Balancing › What it does  Front-end Condor-G distributes all jobs among several back-end Condor-Gs  Front-end Condor-G keeps updated job status › Improves scalability › Maintains single submit point for users

9 www.cs.wisc.edu/condor Load-Balancing Example Condor-G Back-end 1 Condor-G Front-end Condor-G Back-end 3 Condor-G Back-end 2

10 www.cs.wisc.edu/condor Glide-In Schedd › What it does  Drop a Condor-G onto the front-end machine of a cluster  Delegate jobs to the cluster through the glide-in schedd › Apply cluster-specific policies to jobs

11 www.cs.wisc.edu/condor Glide-In Schedd Example Condor-G Glide-In Schedd Batch System

12 www.cs.wisc.edu/condor Multi-Hop Grid Scheduling › Match a job to a Virtual Organization (VO), then to a resource within that VO › Easier to schedule jobs across multiple VOs and grids

13 www.cs.wisc.edu/condor Multi-Hop Grid Scheduling Example Experiment Condor-G Experiment Resource Broker VO Condor-G VO Resource Broker Globus GRAM Batch Scheduler

14 www.cs.wisc.edu/condor Endless Possibilities › These new models can be combined with each other or with other new models › Resulting system can be arbitrarily sophisticated

15 www.cs.wisc.edu/condor Job Delegation Challenges › New complexity introduces new issues and exacerbates existing ones › A few…  Transparency  Representation  Scheduling Control  Active Job Control  Revocation  Error Handling and Debugging

16 www.cs.wisc.edu/condor Transparency › Full information about job should be available to user  Information from full delegation path  No manual tracing across multiple machines › Users need to know what’s happening with their jobs

17 www.cs.wisc.edu/condor Representation › Job state is a vector › How best to show this to user  Summary Current delegation endpoint Job state at endpoint  Full information available if desired Series of nested ClassAds?

18 www.cs.wisc.edu/condor Scheduling Control › Avoid loops in delegation path › Give user control of scheduling  Allow limiting of delegation path length?  Allow user to specify part or all of delegation path

19 www.cs.wisc.edu/condor Active Job Control › User may request certain actions  hold, suspend, vacate, checkpoint › Actions cannot be completed synchronously for user  Must forward along delegation path  User checks completion later

20 www.cs.wisc.edu/condor Active Job Control (cont) › Endpoint systems may not support actions  If possible, execute them at furthest point that does support them › Allow user to apply action in middle of delegation path

21 www.cs.wisc.edu/condor Revocation › Leases  Lease must be renewed periodically for delegation to remain valid  Allows revocation during long-term failures › What are good values for lease lifetime and update interval?

22 www.cs.wisc.edu/condor Error Handling and Debugging › Many more places for things to go horribly wrong › Need clear, simple error semantics › Logs, logs, logs  Have them everywhere

23 www.cs.wisc.edu/condor Current Status › Done  Mirroring › In Progress  Condor-G -> Condor-G delegation User must specify hops  Glide-in schedd Set up by hand

24 www.cs.wisc.edu/condor Thank You! › Questions?


Download ppt "Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed."

Similar presentations


Ads by Google