A Statistical Scheduling Technique for a Computational Market Economy Neal Sample Stanford University
UCSC Research Interests Compositional Computing (GRID) Reliability and Quality of Service Value-based and model-based mediation Languages: “Programming for the non-programmer expert” Database Research Semistructured indexing and storage Massive table/stream compression Approximate algorithms for streaming data
UCSC Why We’re Here Coding Integration/Composition
UCSC GRID: Commodity Computing
UCSC GRID: Commodity Computing
UCSC GRID: Commodity Computing On Demand High Throughput Collaborative Distributed Supercomputing Data Intensive (Large Hadron Collider) (Computer-in-the-loop) (FightAIDSAtHome, Nug30) (Chip design, cryptography) (Data exploration, Education)
UCSC Remote, autonomous Services are not free Fee ($), execution time 2 nd order dependencies “Open Service Model” Principles:GRID, CHAIMS Protocols:UDDI, IETF SLP Runtime:Globus, CPAM Composition of Large Services
UCSC Grid Life is Tough Increased complexity throughout New tools and applications Diverse resources such as computers, storage media, networks, sensors Programming Control flow & data flow separation Service mediation Infrastructure Resource discovery, brokering, monitoring Security/authorization Payment mechanisms
UCSC Our GRID Contributions Programming models and tools System architecture Resource management Instrumentation and performance analysis Network protocols and infrastructure Service mediation
UCSC Other GRID Research Areas The nature of applications Algorithms and problem solving methods Security, payment/escrow, reputation End systems Programming models and tools System architecture Resource management Instrumentation and performance analysis Network protocols and infrastructure Service mediation
UCSC Roadmap Brief introduction to CLAM language Some related scheduling methods Surety-based scheduling Sample program Monitoring Rescheduling Results A few future directions
UCSC Decomposition of CALL-statement Parallelism by asynchrony in sequential program Reduction of complexity of invoke statements Control of new GRID requirements (estimation, trading, brokering, etc.) Abstract out data flow Mediation for data flow control and optimization Extraction model mediation Purely compositional No primitives for arithmetic No primitives for input/output Targets the “non-programmer expert” CLAM Composition Language
UCSC Pre-invocation: SETUP: set up the connection to a service SET-, GETPARAM: in a service ESTIMATE: service cost estimation Invocation and result gathering: INVOKE EXAMINE: test progress of an invoked method EXTRACT: extract results from an invoked method Termination: TERMINATE: terminate a method invocation/connection to a service CLAM Primitives
UCSC Resources + Scheduling Computational Model Multithreading Automatic parallelization Resource Management Process creation OS signal delivery OS scheduling end system
UCSC Resources + Scheduling Computational Model Synchronous communication Distributed shared memory Resource Management Parallel process creation Gang scheduling OS-level signal propagation cluster end system
UCSC Resources + Scheduling Computational Model Client/server Loosely synchronous: pipelines IWIM Resource Management Resource discovery Signal distribution networks cluster intranet end system
UCSC Resources + Scheduling Computational Model Collaborative systems Remote control Data mining Resource Management Brokers Trading Mobile code negotiation cluster intranet end system Internet
UCSC Scheduling Difficulties Adaptation: Repair and Reschedule Schedules for T 0 are only guesses Estimates for multiple stages may become invalid => Schedules must be revised during runtime t0t0 t finish schedule work reschedulehazard work TIME
UCSC Scheduling Difficulties Service Autonomy: No Resource Allocation The scheduler does not handle resource allocation Users observe resources without control Means: Competing objectives have orthogonal scheduling techniques Changing goals for tasks or users means vastly increased scheduling complexity
UCSC Some Related Work R A M Q Rescheduling Autonomy of Services Monitoring Execution QoS, probabilistic execution
UCSC Some Related Work R A M Q Rescheduling Autonomy of Services Monitoring Execution QoS, probabilistic execution PERT Q A M
UCSC Some Related Work R A M Q Rescheduling Autonomy of Services Monitoring Execution QoS, probabilistic execution PERT Q A M CPM M R A
UCSC Some Related Work R A M Q Rescheduling Autonomy of Services Monitoring Execution QoS, probabilistic execution ePERT(AT&T) Condor (Wisconsin) M R Q PERT Q A M CPM M R A
UCSC Some Related Work R A M Q Rescheduling Autonomy of Services Monitoring Execution QoS, probabilistic execution ePERT(AT&T) Condor (Wisconsin) M R Q PERT Q A M CPM M R A Mariposa (UCB) R Q A
UCSC Some Related Work R A M Q Rescheduling Autonomy of Services Monitoring Execution QoS, probabilistic execution ePERT(AT&T) Condor (Wisconsin) M R Q Mariposa (UCB) R Q A PERT Q A M CPM M R A SBS (Stanford) R Q A M
UCSC Sample Program C A D B
UCSC Budgeting Time Maximum allowable execution time Expense Funding available to lease services Surety Goal: schedule probability of success Assessment technique
UCSC Program Schedule as a Template Instantiated at runtime Service provider selection, etc. C A D B C C C C C A A A A B B B B B D D D D
UCSC Program Schedule as a Template Instantiated at runtime Service provider selection, etc. C A D B C C C C C A A A A B B B B B D D D D
UCSC Program Schedule as a Template Instantiated at runtime Service provider selection, etc. C A D B C C C C C A A A A B B B B B D D D D
UCSC Program Schedule as a Template Instantiated at runtime Service provider selection, etc. C A D B C C C C C A A A A B B B B B D D D D
UCSC t 0 Schedule Selection Guided by runtime “bids” Constrained by budget C A D B C C C C C A A A A B B B B B D D D D 7±2h $50 6±1h $40 5±2h $30 3±1h $30
UCSC t 0 Schedule Constraints Budget Time: upper bound- e.g. 22h Cost: upper bound- e.g. $250 Surety:lower bound- e.g. 90% {Time, Cost, Surety} ={22, 250, 90} Steered by user preferences/weights = Selection S1 est [20, 150, 90] = (22-20)*10 + ( )*1 + (90-90)*5 = 120 S2 est [22, 175, 95] = (22-22)*10 + ( )*1 + (95-90)*5 = 100 S3 est [18, 190, 96] = (22-18)*10 + ( )*1 + (96-90)*5 = 130
UCSC budget time budget cost Budget User Pref. Pareto Search Space Expected Program Execution Time Expected Program Cost 0 0 Plans
UCSC Program Evaluation and Review Technique Service times: most likely(m), optimistic(a) and pessimistic(b) and ; N(0, 1) (1) expected duration (service) (2) standard deviation (3) expected duration (program) (4) test value (5) expectation test (6) ~expectation test
UCSC t 0 Complete Schedule Properties Probability Density Probable Program Completion Time deadlineBank = $100 user specified surety
UCSC Individual Service Properties C A B 7±2h 6±1h 5±2h 010 ~finish time probability density
UCSC probable finish time 0 1 t 0 Combined Service Properties 010 ~finish time probability density Deadline (22h) Surety (90%) Current Surety (99.6%) probability density
UCSC Tracking Surety surety % probability density User-specified surety
UCSC Runtime Hazards With control over resource allocation or without runtime hazards Scheduling becomes much easier Runtime implies t 0 schedule invalidation Sample hazards Delays and slowdowns Stoppages Inaccurate estimations Communication loss Competitive displacement… OSM
UCSC Progressive Hazard execution time minimum surety hazard 90 surety % Definition + Detection serviceA start serviceB start (serviceB slow)
UCSC Catastrophic Hazard execution time minimum surety hazard 90 surety % Definition + Detection 0% serviceA start serviceB start (serviceB fails)
UCSC Pseudo-Hazard execution time minimum surety pseudo-hazard 90 surety % Definition + Detection serviceA start serviceB start (serviceB communication failure) 0%
UCSC Monitoring + Repair Observe, not control Complete set of repairs Sufficient (not minimal) Simple cost model: early termination = linear cost recovery Greedy selection of single repair -O(s*r) C A D B
UCSC Schedule Repair execution time t hazard 90 surety % C A D B t repair
UCSC Strategy 0: baseline (no repair) pro:no additional $ cost pro:ideal solution for partitioning hazards con:depends on self-recovery execution time t hazard 90 surety % t repair C A D B
UCSC Strategy 1: service replacement pro:reduces $ lost con:lost investment of $ and time con:concedes recovery chance execution time t hazard 90 surety % C A D B t repair B’
UCSC Strategy 2: service duplication pro:larger boost surety; leverages recovery chance con:large $ cost execution time t hazard 90 surety % C A D B t repair B’
UCSC Strategy 3: pushdown repair pro:cheap, no $ lost pro:no time lost con:cannot handle catastrophic hazards con:requires recovery chance execution time t hazard 90 surety % C A D B t repair C’ x
UCSC Experimental Results Rescheduling options Baseline: no repairs Single strategy repairs Limits flexibility and effectiveness Use all strategies Setup 1000 random DAG schedules, 2-10 services 1-3 hazards per execution Fixed service availability All schedules are repairable
UCSC “The Numbers” What is the value of a close finish? ( late)
UCSC “The Numbers” What is the value of a close finish? ( late)
UCSC Why the Differences? Catastrophic hazard Service provider failure - “do nothing”: no solution to hazard Pseudo-hazard Communication failure, network partition Looks exactly like catastrophic hazard - “do nothing” : the ideal solution Slowdown hazard Not a complete failure, multiple solutions - “do nothing”: ideal or futile or acceptable
UCSC A Challenge Observations of progress are only secondary indicators of current work rate projected finish finish time projected finish
UCSC Open Questions Simultaneous rescheduling Use more than one strategy for a hazard NP to find the optimal solution NP here might not be that hard… Approximations are acceptable Small set Strong constraints NP is worst case, not average case? (e.g., DFBB search) Global impact of local schedule preferences How do local preferences interact in/reshape the global market?
UCSC Open Questions Monitoring resolution adjustments Networks are not free or zero latency Account cost of monitoring Frequent monitoring = more cost Frequent monitoring = greater accuracy Unstudied effect delayed status information Accuracy of t 0 service cost estimates Model as a hazard with delayed detection “1-way hazard” Penalty adjustments
UCSC Deeper Questions User preferences only used in generating initial (t 0 ) schedule fixed least cost repair ( = surety / repair cost) Best cost repair (success sensitive to preference?) Second order cost effects $ left over in budget is purchasing power What is the value of that purchasing power? Sampling for cost estimates during runtime surety = time + progress (+ budgetBalance/valuation)
UCSC Conclusions Novel statistical method for service scheduling Effective strategies for varied hazard mix Achieves per-user-defined Quality of Service Should translate well “out of the sandbox” Clear directions for continued research More information
UCSC
UCSC Steps in Scheduling Estimation Planning Invocation Monitoring Completion Rescheduling
UCSC CHAIMS Scheduler Program Analyzer Input program Planner Requirements Estimator/ Bidder MonitorDispatcher StatusCosts/TimesControl observeinvokehaggle User Requirements (e.g., Budget)
UCSC Simplified Cost Model on time target start/run finish + data transportation costs + Completing the cost model
UCSC Full Cost Model client ready to start hold fee lateearlyon time target start/run reservation finish client ready for data data transportation costs + Completing the cost model
UCSC The Eight Fallacies of Distributed Computing -- Peter Deutsch 1. The network is reliable 2. Latency is zero 3. Bandwidth is infinite 4. The network is secure 5. Topology doesn't change 6. There is one administrator 7. Transport cost is zero 8. The network is homogeneous