Download presentation
Presentation is loading. Please wait.
Published byEgbert Jennings Modified over 8 years ago
1
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison tannenba@cs.wisc.edu http://www.cs.wisc.edu/condor Job Delegation and Planning in Condor-G ISGC 2005 Taipei, Taiwan
2
www.cs.wisc.edu/condor 2 The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.
3
www.cs.wisc.edu/condor 3 The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who: face software engineering challenges in a distributed UNIX/Linux/NT environment are involved in national and international grid collaborations, actively interact with academic and commercial users, maintain and support large distributed production environments, and educate and train students. Funding – US Govt. (DoD, DoE, NASA, NSF, NIH), AT&T, IBM, INTEL, Microsoft, UW-Madison, …
4
www.cs.wisc.edu/condor 4 A Multifaceted Project › Harnessing the power of clusters – dedicated and/or opportunistic (Condor) › Job management services for Grid applications (Condor-G, Stork) › Fabric management services for Grid resources (Condor, GlideIns, NeST) › Distributed I/O technology (Parrot, Kangaroo, NeST) › Job-flow management (DAGMan, Condor, Hawk) › Distributed monitoring and management (HawkEye) › Technology for Distributed Systems (ClassAD, MW) › Packaging and Integration (NMI, VDT)
5
www.cs.wisc.edu/condor 5 Some software produced by the Condor Project › Condor System › ClassAd Library › DAGMan › Fault Tolerant Shell (FTSH) › Hawkeye › GCB › MW › NeST › Stork › Parrot › VDT › And others… all as open source Data!
6
www.cs.wisc.edu/condor 6 Who uses Condor? › Commercial Oracle, Micron, Hartford Life Insurance, CORE, Xerox, Exxon/Mobile, Shell, Alterra, Texas Instruments, … › Research Community Universities, Govt Labs Bundles: NMI, VDT Grid Communities: EGEE/LCG/gLite, Particle Physics Data Grid (PPDG), USCMS, LIGO, iVDGL, NSF Middleware Initiative GRIDS Center, …
7
www.cs.wisc.edu/condor 7 Condor Pool Schedd Startd Schedd MatchMaker Jobs
8
www.cs.wisc.edu/condor 8 Condor Pool Schedd Startd Schedd MatchMaker Jobs
9
www.cs.wisc.edu/condor 9 Condor-G Globus 2 Globus 4 Unicore (Nordugrid) Startd Schedd Jobs LSF PBS Schedd - Condor-G - Condor-C
10
www.cs.wisc.edu/condor 10 User/Application/Portal Fabric ( processing, storage, communication ) Grid Condor Pool Middleware (Globus 2, Globus 4, Unicore, …) Condor-G
11
www.cs.wisc.edu/condor 11 › Transfer of responsibility to schedule and execute a job Stage in executable and data files Transfer policy “instructions” Securely transfer (and refresh?) credentials, obtain local identities Monitor and present job progress (tranparency!) Return results Job Delegation › Multiple delegations can be combined in interesting ways
12
www.cs.wisc.edu/condor 12 Simple Job Delegation in Condor-G Condor-G Globus GRAM Batch System Front-end Execute Machine
13
www.cs.wisc.edu/condor 13 Expanding the Model › What can we do with new forms of job delegation? › Some ideas Mirroring Load-balancing Glide-in schedd, startd Multi-hop grid scheduling
14
www.cs.wisc.edu/condor 14 Mirroring › What it does Jobs mirrored on two Condor-Gs If primary Condor-G crashes, secondary one starts running jobs On recovery, primary Condor-G gets job status from secondary one › Removes Condor-G submit point as single point of failure
15
www.cs.wisc.edu/condor 15 Mirroring Example Condor-G 1 Execute Machine Condor-G 2 Jobs
16
www.cs.wisc.edu/condor 16 Mirroring Example Condor-G 1 Execute Machine Condor-G 2 Jobs
17
www.cs.wisc.edu/condor 17 Load-Balancing › What it does Front-end Condor-G distributes all jobs among several back-end Condor-Gs Front-end Condor-G keeps updated job status › Improves scalability › Maintains single submit point for users
18
www.cs.wisc.edu/condor 18 Load-Balancing Example Condor-G Back-end 1 Condor-G Front-end Condor-G Back-end 3 Condor-G Back-end 2
19
www.cs.wisc.edu/condor 19 Glide-In › Schedd and Startd are separate services that do not require any special privledges Thus we can submit them as jobs! › Glide-In Schedd What it does Drop a Condor-G onto the front-end machine of a remote cluster Delegate jobs to the cluster through the glide-in schedd Can apply cluster-specific policies to jobs Not fork-and-forget… Send a manager to the site, instead of manage across the internet
20
www.cs.wisc.edu/condor 20 Glide-In Schedd Example Condor-G Glide-In Schedd Batch System Jobs Frontend Middleware
21
www.cs.wisc.edu/condor 21 Glide-In Startd Example Condor-G (Schedd) Batch System Frontend Middleware Startd Job
22
www.cs.wisc.edu/condor 22 Glide-In Startd › Why? Restores all the benefits that may have been washed away by the middleware End-to-end management solution Preserves job semantic guarantees Preserves policy Enables lazy planning
23
www.cs.wisc.edu/condor 23 Sample Job Submit file universe = grid grid_type = gt2 globusscheduler = cluster1.cs.wisc.edu/jobmanager-lsf executable = find_particle arguments = …. output = …. log = … But we want metascheduling…
24
www.cs.wisc.edu/condor 24 Represent grid clusters as ClassAds › ClassAds are a set of uniquely named expressions; each expression is called an attribute and is an attribute name/value pair combine query and data extensible semi-structured : no fixed schema (flexibility in an environment consisting of distributed administrative domains) Designed for “MatchMaking”
25
www.cs.wisc.edu/condor 25 Example of a ClassAd that could represent a compute cluster in a grid: Type = "GridSite"; Name = "FermiComputeCluster"; Arch = “Intel-Linux”; Gatekeeper_url = "globus.fnal.gov/lsf" Load = [ QueuedJobs = 42; RunningJobs = 200; ]; Requirements = ( other.Type == "Job" && Load.QueuedJobs < 100 ); GoodPeople = { "howard", "harry" }; Rank = member(other.Owner, GoodPeople) * 500
26
www.cs.wisc.edu/condor 26 Another Sample - Job Submit universe = grid grid_type = gt2 owner = howard executable = find_particle.$$(Arch) requirements = other.Arch == “Intel-Linux” || other.Arch == “Sparc-Solaris” rank = 0 – other.Load.QueuedJobs; globusscheduler = $$(gatekeeper_url) … Note: We introduced augmentation of the job ClassAd based upon information discovered in its matching resource ClassAd.
27
www.cs.wisc.edu/condor 27 Multi-Hop Grid Scheduling › Match a job to a Virtual Organization (VO), then to a resource within that VO › Easier to schedule jobs across multiple VOs and grids
28
www.cs.wisc.edu/condor 28 Multi-Hop Grid Scheduling Example Experiment Condor-G Experiment Resource Broker VO Condor-G VO Resource Broker Globus GRAM Batch Scheduler HEPCMS
29
www.cs.wisc.edu/condor 29 Endless Possibilities › These new models can be combined with each other or with other new models › Resulting system can be arbitrarily sophisticated
30
www.cs.wisc.edu/condor 30 Job Delegation Challenges › New complexity introduces new issues and exacerbates existing ones › A few… Transparency Representation Scheduling Control Active Job Control Revocation Error Handling and Debugging
31
www.cs.wisc.edu/condor 31 Transparency › Full information about job should be available to user Information from full delegation path No manual tracing across multiple machines › Users need to know what’s happening with their jobs
32
www.cs.wisc.edu/condor 32 Representation › Job state is a vector › How best to show this to user Summary Current delegation endpoint Job state at endpoint Full information available if desired Series of nested ClassAds?
33
www.cs.wisc.edu/condor 33 Scheduling Control › Avoid loops in delegation path › Give user control of scheduling Allow limiting of delegation path length? Allow user to specify part or all of delegation path
34
www.cs.wisc.edu/condor 34 Active Job Control › User may request certain actions hold, suspend, vacate, checkpoint › Actions cannot be completed synchronously for user Must forward along delegation path User checks completion later
35
www.cs.wisc.edu/condor 35 Active Job Control (cont) › Endpoint systems may not support actions If possible, execute them at furthest point that does support them › Allow user to apply action in middle of delegation path
36
www.cs.wisc.edu/condor 36 Revocation › Leases Lease must be renewed periodically for delegation to remain valid Allows revocation during long-term failures › What are good values for lease lifetime and update interval?
37
www.cs.wisc.edu/condor 37 Error Handling and Debugging › Many more places for things to go horribly wrong › Need clear, simple error semantics › Logs, logs, logs Have them everywhere
38
www.cs.wisc.edu/condor 38 From earlier › Transfer of responsibility to schedule and execute a job Transfer policy “instructions” Stage in executable and data files Securely transfer (and refresh?) credentials, obtain local identities Monitor and present job progress (tranparency!) Return results
39
www.cs.wisc.edu/condor 39 Job Failure Policy Expressions › Condor/Condor-G augemented so users can supply job failure policy expressions in the submit file. › Can be used to describe a successful run, or what to do in the face of failure. on_exit_remove = on_exit_hold = periodic_remove = periodic_hold =
40
www.cs.wisc.edu/condor 40 Job Failure Policy Examples › Do not remove from queue (i.e. reschedule) if exits with a signal: on_exit_remove = ExitBySignal == False › Place on hold if exits with nonzero status or ran for less than an hour: on_exit_hold = ((ExitBySignal==False) && (ExitSignal != 0)) || ((ServerStartTime – JobStartDate) < 3600) › Place on hold if job has spent more than 50% of its time suspended: periodic_hold = CumulativeSuspensionTime > (RemoteWallClockTime / 2.0)
41
www.cs.wisc.edu/condor 41 Data Placement * (DaP) must be an integral part of the end-to-end solution Space management and Data transfer *
42
www.cs.wisc.edu/condor 42 Stork › A scheduler for data placement activities in the Grid › What Condor is for computational jobs, Stork is for data placement › Stork comes with a new concept: “Make data placement a first class citizen in the Grid.”
43
www.cs.wisc.edu/condor 43 Stage-in Execute the Job Stage-out Stage-in Execute the jobStage-outRelease input spaceRelease output space Allocate space for input & output data Data Placement Jobs Computational Jobs
44
www.cs.wisc.edu/condor 44 DAGMan DAG with DaP Condor Job Queue DaP A A.submit DaP B B.submit Job C C.submit ….. Parent A child B Parent B child C Parent C child D, E ….. C Stork Job Queue E DAG specification ACB D E F
45
www.cs.wisc.edu/condor 45 Why Stork? › Stork understands the characteristics and semantics of data placement jobs. › Can make smart scheduling decisions, for reliable and efficient data placement.
46
www.cs.wisc.edu/condor 46 Failure Recovery and Efficient Resource Utilization › Fault tolerance Just submit a bunch of data placement jobs, and then go away.. › Control number of concurrent transfers from/to any storage system Prevents overloading › Space allocation and De-allocations Make sure space is available
47
www.cs.wisc.edu/condor 47 Support for Heterogeneity Protocol translation using Stork memory buffer.
48
www.cs.wisc.edu/condor 48 Support for Heterogeneity Protocol translation using Stork Disk Cache.
49
www.cs.wisc.edu/condor 49 Flexible Job Representation and Multilevel Policy Support [ Type = “Transfer”; Src_Url = “srb://ghidorac.sdsc.edu/kosart.condor/x.dat”; Dest_Url = “nest://turkey.cs.wisc.edu/kosart/x.dat”; …… Max_Retry = 10; Restart_in = “2 hours”; ]
50
www.cs.wisc.edu/condor 50 Run-time Adaptation › Dynamic protocol selection [ dap_type = “transfer”; src_url = “drouter://slic04.sdsc.edu/tmp/test.dat”; dest_url = “drouter://quest2.ncsa.uiuc.edu/tmp/test.dat”; alt_protocols = “nest-nest, gsiftp-gsiftp”; ] [ dap_type = “transfer”; src_url = “any://slic04.sdsc.edu/tmp/test.dat”; dest_url = “any://quest2.ncsa.uiuc.edu/tmp/test.dat”; ]
51
www.cs.wisc.edu/condor 51 Run-time Adaptation › Run-time Protocol Auto-tuning [ link = “slic04.sdsc.edu – quest2.ncsa.uiuc.edu”; protocol = “gsiftp”; bs = 1024KB;//block size tcp_bs= 1024KB;//TCP buffer size p= 4; ]
52
www.cs.wisc.edu/condor 52 Planner DAGMan Condor-G Stork RFT GRAM SRM StartD SRB NeST GridFTP Application Parrot
53
www.cs.wisc.edu/condor 53 Thank You! › Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.