1 Task Assignment with Unknown Duration Mor Harchol-Balter Carnegie Mellon
2 Distributed Server Load Balancer employs TAP (Task Assignment Policy): rule for assigning jobs to hosts Age-old Question: What’s a good TAP ? L.B Large # jobs
3 The Model L.B. Large # jobs FCFS Processing requirement (size) of job is not known. Jobs are not preemptible. Jobs queued at a host are processed in FCFS order. Hosts are identical. Motivation for model: Distributed servers for supercomputing, where each host is a multi-processor.
4 1. Round-Robin 2. Random 3. Shortest-Queue Send job to host with fewest number jobs. 4. Least-Work-Left Central-Queue Send job to host with Host grabs next job when free. least total work left. L.B Which TAP is best (given model)? “best” -- minimize mean waiting time
5 1. Round-Robin 2. Random 3. Shortest-Queue Send job to host with fewest number jobs. 4. Least-Work-Left Central-Queue Send job to host with Host grabs next job least total work left. when free. L.B Which TAP is best (given model)? “best” -- minimize mean waiting time Known: Optimal for exponentially- distributed sizes.
6 But real jobs do NOT have exponentially-distributed sizes! They have heavy-tailed sizes.
7 Unix process CPU lifetime measurements [Harchol-Balter, Downey TOCS 97] We measured over 1 million UNIX processes. Instructional, research, and sys. admin. machines. Job of cpu age x has probability 1/2 of using another x. Fraction of jobs with CPU duration > x Duration ( x secs) (log-log plot) Pr{Size > x } = 1 x
8 Bounded Pareto (heavy-tailed) distribution Properties: Decreasing Failure Rate Very high variance! Heavy-tail property -- Miniscule fraction (<1%) of the very largest jobs comprise half the load. 1 job size 0 Pr{}Sizexx , 0 2 : degree of variability more variable & more heavy-tailed less variable & less heavy-tailed minmax
9 1. Round-Robin 2. Random 3. Shortest-Queue Send job to host with fewest number jobs. 4. Least-Work-Left Central-Queue Send job to host with Host grabs next job least total work left. when free. L.B Which TAP is best for heavy-tailed job sizes? “best” -- minimize mean waiting time Known: Optimal for exponentially- distributed sizes.
10 The TAGS algorithm “Task Assignment by Guessing Size” When job at host j reaches size s j, then job is killed and restarted from scratch at host j+1 s3s3 s2s2 s1s1 Outside Arrivals Host 1 Host 2 Host 3 Host 4
11 3 Flavors of TAGS How to choose the cutoffs: s 1, s 2, s 3, … TAGS-opt-meanslowdown TAGS-opt-meanwaitingtime TAGS-opt-fairness
12 TAGS is counterintuitive TAGS wastes resources … non-workconserving Big jobs seem unfairly penalized … yet somehow turns out to be fair? TAGS always operates under unbalanced load.
13 Results of Analysis 2 hosts only -- system load =.5 Random Least-Work-Left TAGS-opt-fairness Random Least-Work-Left TAGS-opt-slowdown
14 Results of Analysis 2 hosts only -- system load =.5 Random Least-Work-Left TAGS-opt-waitingtime
15 More Results 4 hosts -- system load =.3 Random Least-Work-Left TAGS
16 More Results New metric: Server Expansion Server expansion = number of hosts we would have to add to system to get mean slowdown down to 2 or 3. (Initial system: 2 hosts, system load =.7) TAGS Least-Work-Left
17 WHY does TAGS work so well? 1) Reduction of variance of job size distribution 2) Load Unbalancing
18 WHY does TAGS work so well? 1) Reduction of variance of job size distribution: TAGS reduces variance of job size distribution at the hosts. No other policy does this! EW E X ( - ) {} {} 2 21 Second moment of Job Size Distribution Recall, P-K formula for M/G/1 queue: FCFS Mean Waiting Time
19 WHY does TAGS work so well? 2) Load Unbalancing: All other policies aim to balance the load. TAGS unbalances load. TAGS-opt-slowdownTAGS-opt-fairness Host 1 Host 2 Host 1 This is fair? YES
20 Conclusion This research challenges our common wisdom: Load unbalancing may be better than load balancing. It may be worthwhile to waste resources by restarting a job from scratch at a new machine … even if the new machine has a much higher load than the original machine! A policy which appears to greatly penalize large jobs may actually be fair.