Scheduling Under Uncertainty: Planning for the Ubiquitous Grid Neal Sample Pedram Keyani Gio Wiederhold Stanford University
Coordination Why We’re Here Coding Integration/Composition
Coordination Sample Composition Tasks Logistics Reservation and distribution systems, “find the best transportation route from A to B” Genomics Framework for composing various processing tools and repositories Modeling Weather prediction, complex chemical systems, basin modeling Composition of services (vs. components, data)
Coordination Remote, autonomous Services are not free Fee (£) Execution time Open Service Model GRID – principles UDDI, IETF SLP – protocols Globus, CPAM – runtime support Composition of Large Services
Coordination Service Scheduling Goals Closest to Soft Real-time, Job Shop Objectives Minimize transaction time Minimize transaction cost Differences No control over service availability No control over resource allocation No control over workplace loads => Schedules become inaccurate
Coordination New Scheduling Requirements Why not traditional scheduling (e.g., CSP)? Runtime performance changes More than just scheduling: rescheduling in the face of runtime hazards Why not traditional rescheduling? No resource allocation/control “Observe, not control”
Coordination Scheduling Difficulties Adaptation: Schedules must be adaptive Schedules for T 0 are only guesses Estimates for multiple stages may become invalid => Schedules must be revised during runtime Allocation: The scheduler does not handle resource allocation Means: Competing objectives have orthogonal scheduling techniques Changing goals for tasks or users means vastly increased scheduling complexity
Coordination Sample Program //sample program BEGIN out1 = serviceA() out2 = serviceB(out1) out3 = serviceC(out2) out4 = serviceD(out2) END //declarative C A D B
Coordination Budgeting Time Maximum allowable execution time Expense Total resources available to lease services Surety Schedule confidence Goal and assessment technique
Coordination Program Schedule as a Template Instantiated at runtime Service provider selection, etc. D A C B D D D D D A A A A B B B B B C C C C
Coordination Program Schedule as a Template Instantiated at runtime Service provider selection, etc. D A C B D D D D D A A A A B B B B B C C C C
Coordination Steps in Scheduling Estimation Planning Invocation Monitoring Completion Rescheduling
Coordination CHAIMS Scheduler Program Analyzer Input program Planner Requirements Estimator/ Bidder MonitorDispatcher StatusCosts/TimesControl observeinvokehaggle Budget
Coordination t 0 Schedule Selection Guided by runtime “bids” Constrained by budget D A C B D D D D D A A A A B B B B B C C C C 7±2h £50 6±1h £40 5±2h £30 3±1h £30
Coordination t 0 Schedule Constraints Budget Time: upper bound- e.g. 22h Cost: upper bound- e.g. £250 Surety:lower bound- e.g. 90% {22, 250, 90} Steered by user preferences/weights = Selection (single value convolution) S1 est [20, 150, 90] = (22-20)*10 + ( )*1 + (90-90)*5 = 120 S2 est [22, 175, 95] = (22-22)*10 + ( )*1 + (95-90)*5 = 100 S3 est [18, 190, 96] = (22-18)*10 + ( )*1 + (96-90)*5 = 130
Coordination Program Evaluation and Review Technique (PERT) Service times: most likely(m), optimistic(a) and pessimistic(b) and ; N(0, 1) (1) expected duration (service) (2) standard deviation (3) expected duration (program) (4) test value (5) expectation test (6) ~expectation test
Coordination t 0 Schedule Properties Probability Density Probable Completion Time deadlineBank = £100 surety
Coordination Runtime Hazards With resource allocation or without hazards Scheduling becomes trivial Runtime implies t 0 schedule invalidation Sample hazards Delays and slowdowns Stoppages Inaccurate estimations Communication loss Competitive displacement… OSM
Coordination Definition + Detection execution time minimum surety hazard 90 surety % PROGRESSIVE HAZARD serviceA start serviceB start (serviceB slow)
Coordination Definition + Detection execution time minimum surety hazard 90 surety % CATASTROPHIC HAZARD 0% serviceA start serviceB start (serviceB fails)
Coordination Monitoring Observe, not control CPAM runtime support Parameter presetting ESTIMATE(…) primitive for service cost Used a t 0 and t reschedule Service progress EXAMINE(…) primitive Used with PERT to detect surety hazards C A D B
Coordination Schedule Repair Simple cost model: early termination = linear £ recovery Greedy selection of single repair – O(s*r) execution time t hazard 90 surety % C A D B t repair
Coordination Strategy 1: service replacement Pro: minimize £ lost Pro: boost surety Con: lost investment of £ and time Con: concedes recovery chance execution time t hazard 90 surety % C A D B t repair B’
Coordination Strategy 2: service duplication Pro: large boost surety Pro: leverages recovery chance Con: large £ cost execution time t hazard 90 surety % C A D B t repair B’
Coordination Strategy 3: pushdown repair Pro: cheap, no £ lost Pro: no time lost Con: cannot handle all hazard types, e.g. catastrophic hazards Con: requires recovery chance execution time t hazard 90 surety % C A D B t repair C’ x
Coordination Strategy 4: do nothing/bail-out Pro: no additional £ cost Pro: ideal solution for partitioning hazards Con: generally non-effective Con: depends on self-recovery execution time t hazard 90 surety % t repair C A D B
Coordination Experimental Results Rescheduling options Limit repair options to one strategy Limits flexibility and effectiveness Use all strategies Setup 1000 random DAG schedules, 2-10 services 1-3 hazards per execution Fixed service availability All schedules are recoverable
Coordination “The Numbers” Value of close finishes? (!= 100% surety)
Coordination Why the Differences? Catastrophic hazard Service provider failure - Cannot be solved by “do nothing” Pseudo-hazard Communication failure, network partition Looks exactly like catastrophic hazard Can’t terminate for £ recovery - Appropriate solution is “do nothing” Slowdown hazard (actual or apparent) Not a complete failure, multiple solutions - “do nothing” may be ideal or futile
Coordination A Fundamental Weakness Observations of progress are only secondary indicators of current work rate projected finish finish time
Coordination Open Questions Mundane issues Taxonomy of hazard/solution combinations Vary service provider densities Monitor resolution adjustments Networks are not free or zero latency Unstudied effect delayed status information Pseudo-hazards What is a good amount of delay to avoid them? (without getting into deeper trouble…) Accuracy of t 0 service cost estimates ~hazard with delayed detection 1-way hazard
Coordination (Deeper) Open Questions User preferences only used in generating initial (t 0 ) schedule fixed least cost repair ( = surety / repair cost) Best cost repair (success sensitive to preference?) Second order cost effects £ left over in budget is purchasing power What is the value of that purchasing power? Sampling for cost estimates during runtime Surety = time + progress (+ budget balance) Penalty regimes
Coordination (Deeper) Open Questions Simultaneous rescheduling Use more than one strategy for a hazard NP – reduction to Hamiltonian Path NP here might not be that hard… Approximations are acceptable Small set Strong constraints NP is worst case, not average case…
Coordination (Deeper) Open Questions on time target start/run finish + data transportation costs + Completing the cost model
Coordination (Deeper) Open Questions client ready to start hold fee lateearlyon time target start/run reservation finish client ready for data data transportation costs + Completing the cost model
Coordination Conclusions Initial results given artificial hazards Seemingly effective rescheduling strategies Difficult to characterize the solutions Should translate well out of the sandbox and into an actual runtime Clear directions for continued research Project home