Bag-of-Tasks Scheduling under Budget Constraints Ana-Maria Oprescu, Thilo Kielman Presented by Bryan Rosander
Bag-of-Tasks Background Bag-of-Task problems are common and computationally expensive Condor provides a framework for "High-Throughput Computing" o Opportunistic batch processing Use idle computing resources with little to no user intervention o Condor project started in 1988 o Allowed both scientists and small businesses to take advantage of already purchased computing power
Cloud Computing? Well suited to cloud computing Pay-for-use computing power o Allows users to choose from several different classes of computing power at different price points o Makes computationally intensive tasks possible for those that can't afford data-centers or supercomputers Cloud services don't offer much guidance May not know computing cost characteristics of each job
Budget-Aware Cloud Batch Processing Cloud Computing makes high throughput computing feasible for those without the resources to purchase the hardware o Allows users to choose from several different classes of computing power BaTS is a "budget-constrained scheduler" o Capable of scheduling large bags of tasks o Can utilize multiple clouds with different characteristics o Does not need prior knowledge about tasks or completion times o Will complete tasks within given budget or terminate when it is determined unfeasible o Will attempt to minimize run-time without violating budget constraint
BaTS - The Algorithm Assumptions: o Tasks in a bag are independent of each-other o Tasks can be preempted if necessary to reconfigure cloud environment o There is some unknown distribution of execution times o We know the number of tasks to be executed o Machines belong to multiple categories and machines within the same category are homogeneous o Category pricing is available and consistent "Scheduling large bags of tasks onto multiple cloud platforms" BaTS is run on a master machine (can be outside cloud environment)
Sampling Phase Sampling with replacement done on a per cluster level with a subset of size n n must be <= 0.05 * N in practice (actual upper bound given by below formula) Use modified cumulative moving average of task execution times from sample
Goal: Dynamic programming:
Rest of Run Update plan at regular intervals (at least 5 minutes apart) Continual refinement of estimated task completion time Ensure that machines do not become under utilized Machine cost incurred at start of ATU but jobs may not finish until end o Must look at how many tasks will still be undone when each machine runs out of time
Use previous cumulative average for time values, using this formula for unfinished tasks Use formulas on the left to see if the tasks should be finished within the constraints
Testing Emulated different types of clouds on DAS-3 (Distributed ASCI Supercomputer 3) multi-cluster system Requests for machines are handled realistically, with significant delay Medium size workload o 1000 tasks with normal distribution with mean minutes 15 and std deviation as the square root of 5 minutes o Cluster 0 is $3 per machine per ATU, cluster 1 varies as follows S1-1 is the same in price and performance S1-4 is the same price, 4 times as fast S4-1 is 4 times as expensive, with same speed S3-4 is 3 times as expensive and 4 times as fast S4-3 is 4 times as expensive and 3 times as fast
Conclusions BaTS winds up with slower execution times than RR (round robin) given the same budget (due to the sampling phase) BaTS is capable of staying within budget when possible or terminating early if not When given smaller budgets, BaTS is cheaper but slower than RR Would be helpful to find a way to suggest suitable budgets for tasks High complexity of algorithm would be prohibitive to drastically increasing number of classes of computers, number of possible workers