Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Job Delegation and Planning in Condor-G ISGC 2005 Taipei, Taiwan
2 The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.
3 The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who: face software engineering challenges in a distributed UNIX/Linux/NT environment are involved in national and international grid collaborations, actively interact with academic and commercial users, maintain and support large distributed production environments, and educate and train students. Funding – US Govt. (DoD, DoE, NASA, NSF, NIH), AT&T, IBM, INTEL, Microsoft, UW-Madison, …
4 A Multifaceted Project › Harnessing the power of clusters – dedicated and/or opportunistic (Condor) › Job management services for Grid applications (Condor-G, Stork) › Fabric management services for Grid resources (Condor, GlideIns, NeST) › Distributed I/O technology (Parrot, Kangaroo, NeST) › Job-flow management (DAGMan, Condor, Hawk) › Distributed monitoring and management (HawkEye) › Technology for Distributed Systems (ClassAD, MW) › Packaging and Integration (NMI, VDT)
5 Some software produced by the Condor Project › Condor System › ClassAd Library › DAGMan › Fault Tolerant Shell (FTSH) › Hawkeye › GCB › MW › NeST › Stork › Parrot › VDT › And others… all as open source Data!
6 Who uses Condor? › Commercial Oracle, Micron, Hartford Life Insurance, CORE, Xerox, Exxon/Mobile, Shell, Alterra, Texas Instruments, … › Research Community Universities, Govt Labs Bundles: NMI, VDT Grid Communities: EGEE/LCG/gLite, Particle Physics Data Grid (PPDG), USCMS, LIGO, iVDGL, NSF Middleware Initiative GRIDS Center, …
7 Condor Pool Schedd Startd Schedd MatchMaker Jobs
8 Condor Pool Schedd Startd Schedd MatchMaker Jobs
9 Condor-G Globus 2 Globus 4 Unicore (Nordugrid) Startd Schedd Jobs LSF PBS Schedd - Condor-G - Condor-C
10 User/Application/Portal Fabric ( processing, storage, communication ) Grid Condor Pool Middleware (Globus 2, Globus 4, Unicore, …) Condor-G
11 › Transfer of responsibility to schedule and execute a job Stage in executable and data files Transfer policy “instructions” Securely transfer (and refresh?) credentials, obtain local identities Monitor and present job progress (tranparency!) Return results Job Delegation › Multiple delegations can be combined in interesting ways
12 Simple Job Delegation in Condor-G Condor-G Globus GRAM Batch System Front-end Execute Machine
13 Expanding the Model › What can we do with new forms of job delegation? › Some ideas Mirroring Load-balancing Glide-in schedd, startd Multi-hop grid scheduling
14 Mirroring › What it does Jobs mirrored on two Condor-Gs If primary Condor-G crashes, secondary one starts running jobs On recovery, primary Condor-G gets job status from secondary one › Removes Condor-G submit point as single point of failure
15 Mirroring Example Condor-G 1 Execute Machine Condor-G 2 Jobs
16 Mirroring Example Condor-G 1 Execute Machine Condor-G 2 Jobs
17 Load-Balancing › What it does Front-end Condor-G distributes all jobs among several back-end Condor-Gs Front-end Condor-G keeps updated job status › Improves scalability › Maintains single submit point for users
18 Load-Balancing Example Condor-G Back-end 1 Condor-G Front-end Condor-G Back-end 3 Condor-G Back-end 2
19 Glide-In › Schedd and Startd are separate services that do not require any special privledges Thus we can submit them as jobs! › Glide-In Schedd What it does Drop a Condor-G onto the front-end machine of a remote cluster Delegate jobs to the cluster through the glide-in schedd Can apply cluster-specific policies to jobs Not fork-and-forget… Send a manager to the site, instead of manage across the internet
20 Glide-In Schedd Example Condor-G Glide-In Schedd Batch System Jobs Frontend Middleware
21 Glide-In Startd Example Condor-G (Schedd) Batch System Frontend Middleware Startd Job
22 Glide-In Startd › Why? Restores all the benefits that may have been washed away by the middleware End-to-end management solution Preserves job semantic guarantees Preserves policy Enables lazy planning
23 Sample Job Submit file universe = grid grid_type = gt2 globusscheduler = cluster1.cs.wisc.edu/jobmanager-lsf executable = find_particle arguments = …. output = …. log = … But we want metascheduling…
24 Represent grid clusters as ClassAds › ClassAds are a set of uniquely named expressions; each expression is called an attribute and is an attribute name/value pair combine query and data extensible semi-structured : no fixed schema (flexibility in an environment consisting of distributed administrative domains) Designed for “MatchMaking”
25 Example of a ClassAd that could represent a compute cluster in a grid: Type = "GridSite"; Name = "FermiComputeCluster"; Arch = “Intel-Linux”; Gatekeeper_url = "globus.fnal.gov/lsf" Load = [ QueuedJobs = 42; RunningJobs = 200; ]; Requirements = ( other.Type == "Job" && Load.QueuedJobs < 100 ); GoodPeople = { "howard", "harry" }; Rank = member(other.Owner, GoodPeople) * 500
26 Another Sample - Job Submit universe = grid grid_type = gt2 owner = howard executable = find_particle.$$(Arch) requirements = other.Arch == “Intel-Linux” || other.Arch == “Sparc-Solaris” rank = 0 – other.Load.QueuedJobs; globusscheduler = $$(gatekeeper_url) … Note: We introduced augmentation of the job ClassAd based upon information discovered in its matching resource ClassAd.
27 Multi-Hop Grid Scheduling › Match a job to a Virtual Organization (VO), then to a resource within that VO › Easier to schedule jobs across multiple VOs and grids
28 Multi-Hop Grid Scheduling Example Experiment Condor-G Experiment Resource Broker VO Condor-G VO Resource Broker Globus GRAM Batch Scheduler HEPCMS
29 Endless Possibilities › These new models can be combined with each other or with other new models › Resulting system can be arbitrarily sophisticated
30 Job Delegation Challenges › New complexity introduces new issues and exacerbates existing ones › A few… Transparency Representation Scheduling Control Active Job Control Revocation Error Handling and Debugging
31 Transparency › Full information about job should be available to user Information from full delegation path No manual tracing across multiple machines › Users need to know what’s happening with their jobs
32 Representation › Job state is a vector › How best to show this to user Summary Current delegation endpoint Job state at endpoint Full information available if desired Series of nested ClassAds?
33 Scheduling Control › Avoid loops in delegation path › Give user control of scheduling Allow limiting of delegation path length? Allow user to specify part or all of delegation path
34 Active Job Control › User may request certain actions hold, suspend, vacate, checkpoint › Actions cannot be completed synchronously for user Must forward along delegation path User checks completion later
35 Active Job Control (cont) › Endpoint systems may not support actions If possible, execute them at furthest point that does support them › Allow user to apply action in middle of delegation path
36 Revocation › Leases Lease must be renewed periodically for delegation to remain valid Allows revocation during long-term failures › What are good values for lease lifetime and update interval?
37 Error Handling and Debugging › Many more places for things to go horribly wrong › Need clear, simple error semantics › Logs, logs, logs Have them everywhere
38 From earlier › Transfer of responsibility to schedule and execute a job Transfer policy “instructions” Stage in executable and data files Securely transfer (and refresh?) credentials, obtain local identities Monitor and present job progress (tranparency!) Return results
39 Job Failure Policy Expressions › Condor/Condor-G augemented so users can supply job failure policy expressions in the submit file. › Can be used to describe a successful run, or what to do in the face of failure. on_exit_remove = on_exit_hold = periodic_remove = periodic_hold =
40 Job Failure Policy Examples › Do not remove from queue (i.e. reschedule) if exits with a signal: on_exit_remove = ExitBySignal == False › Place on hold if exits with nonzero status or ran for less than an hour: on_exit_hold = ((ExitBySignal==False) && (ExitSignal != 0)) || ((ServerStartTime – JobStartDate) < 3600) › Place on hold if job has spent more than 50% of its time suspended: periodic_hold = CumulativeSuspensionTime > (RemoteWallClockTime / 2.0)
41 Data Placement * (DaP) must be an integral part of the end-to-end solution Space management and Data transfer *
42 Stork › A scheduler for data placement activities in the Grid › What Condor is for computational jobs, Stork is for data placement › Stork comes with a new concept: “Make data placement a first class citizen in the Grid.”
43 Stage-in Execute the Job Stage-out Stage-in Execute the jobStage-outRelease input spaceRelease output space Allocate space for input & output data Data Placement Jobs Computational Jobs
44 DAGMan DAG with DaP Condor Job Queue DaP A A.submit DaP B B.submit Job C C.submit ….. Parent A child B Parent B child C Parent C child D, E ….. C Stork Job Queue E DAG specification ACB D E F
45 Why Stork? › Stork understands the characteristics and semantics of data placement jobs. › Can make smart scheduling decisions, for reliable and efficient data placement.
46 Failure Recovery and Efficient Resource Utilization › Fault tolerance Just submit a bunch of data placement jobs, and then go away.. › Control number of concurrent transfers from/to any storage system Prevents overloading › Space allocation and De-allocations Make sure space is available
47 Support for Heterogeneity Protocol translation using Stork memory buffer.
48 Support for Heterogeneity Protocol translation using Stork Disk Cache.
49 Flexible Job Representation and Multilevel Policy Support [ Type = “Transfer”; Src_Url = “srb://ghidorac.sdsc.edu/kosart.condor/x.dat”; Dest_Url = “nest://turkey.cs.wisc.edu/kosart/x.dat”; …… Max_Retry = 10; Restart_in = “2 hours”; ]
50 Run-time Adaptation › Dynamic protocol selection [ dap_type = “transfer”; src_url = “drouter://slic04.sdsc.edu/tmp/test.dat”; dest_url = “drouter://quest2.ncsa.uiuc.edu/tmp/test.dat”; alt_protocols = “nest-nest, gsiftp-gsiftp”; ] [ dap_type = “transfer”; src_url = “any://slic04.sdsc.edu/tmp/test.dat”; dest_url = “any://quest2.ncsa.uiuc.edu/tmp/test.dat”; ]
51 Run-time Adaptation › Run-time Protocol Auto-tuning [ link = “slic04.sdsc.edu – quest2.ncsa.uiuc.edu”; protocol = “gsiftp”; bs = 1024KB;//block size tcp_bs= 1024KB;//TCP buffer size p= 4; ]
52 Planner DAGMan Condor-G Stork RFT GRAM SRM StartD SRB NeST GridFTP Application Parrot
53 Thank You! › Questions?