Omega: flexible, scalable schedulers for large compute clusters

Omega: flexible, scalable schedulers for large compute clusters
April 17th, 2013 Omega: flexible, scalable schedulers for large compute clusters Malte Schwarzkopf (University of Cambridge Computer Lab) Andy Konwinski (UC Berkeley) Michael Abd-El-Malek (Google) John Wilkes (Google) Hello everyone! I'll be talking about scheduling in Omega today. Omega is Google's next-generation cluster management system, and Andy and myself were privileged to make a small contribution to it while interning with Mike and John at Google. 1 EuroSys 2013

the scheduling problem
Job Tasks Machines Now, Omega is a complex system with many aspects to it. Today, I'll focus on scheduling, that is, the problem of mapping work to resources. In a shared compute cluster, this means mapping tasks, which are shown as circles here, and that are part of a job, to machines. The example job has two tasks here, but in reality, this varies between one and many thousands of tasks. 2 EuroSys 2013

Increasing cluster sizes
trends observed Diverse workloads Increasing cluster sizes At Google, we have observed a few trends over the last years of running the old cluster management system. First of all, workloads are very diverse in shapes, sizes and requirements. This makes the cluster schedulers job a lot more difficult, as we'll see. <CLICK> At the same time, cluster sizes keep increasing, <CLICK> and so does the number of jobs arriving, as more and more applications are added. Growing job arrival rates 3 EuroSys 2013

scheduling logic why is this a problem?
Arriving jobs and tasks (1,000s) 60+ seconds! scheduling logic Cluster scheduler Cluster machines (10,000s) So how does the typical cluster scheduler work? Well, it tracks the state of tens of thousands of machines, shown as squares here. Many jobs from all kinds of different workloads arrive all the time, and proceed through scheduling logic in the cluster scheduler to be assigned to machines.<CLICK> This scheduling logic could be the same for everyone, <CLICK>but is much more likely to become quite heterogeneous over time, with the scheduler adding special features for different job types. For example, some important service jobs require careful placement for fault tolerance and optimal resource use. <CLICK> This can take a long time: for example, using constraint solvers or Monte Carlo simulations means that scheduling a job may take many seconds or even a minute. 4 EuroSys 2013

Increasing complexity!
why is this a problem? Arriving jobs and tasks (1,000s) Increasing complexity! scheduling logic Cluster scheduler Cluster machines (10,000s) Hence:Break up into independent schedulers. This feature creep, however, leads to an increasingly complex scheduler implementation. <CLICK> We've been through this with the current Google cluster scheduler. Over the years, many features, short-cuts and extensions were requested and implemented for specific workloads. This significantly increases the engineering and maintenance complexity for the scheduler. There is a solution, though: we can break the monolithic scheduler up into modular ones. These are simpler to maintain and can have independent code bases and even development teams. But breaking up the scheduler means we have to somehow coordinate resource assignment, and decide who gets what! But: How do we arbitrate resources between schedulers? EuroSys 2013

existing approaches monolithic scheduler SCHEDULER static partitioning
hard to diversify code growth scalability bottleneck static partitioning poor utilization inflexible S0 S1 S2 Now, there are various ways this can be done. The straightforward one is to simply break up the cluster too, and just have a scheduler for each subcluster. So, for example, the MapReduce scheduler runs on the MapReduce subcluster, while critical infrastructure services are scheduled and run in an infrastructure subcluster. That approach has some grave disadvantages, though: the partitioning is static, so one subcluster can be overloaded while others have space capacity. 6 EuroSys 2013

existing approaches two-level shared-state RESOURCE MANAGER
e.g. UCB Mesos [NSDI 2011] hoarding information hiding S0 S1 S2 RESOURCE MANAGER two-level S0 S1 S2 CLUSTER STATE shared-state Clearly, a possible way to address this is by making the partitioning dynamic. This is what existing systems do, for example Mesos from Berkeley: they schedule resources on two levels, with a higher-level resource manager and a set of schedulers that each work within their resource allocations. We will later see that this has some problems, too. In Omega, we look at a different way of doing things: every scheduler can lay claim to any resource at any point in time. The cluster state data structure is effectively shared between all schedulers. There is no need for any reservation or a-priori coordination between schedulers -- they just try, hope for the best and sort out any problems afterwards! Let's see how this works in detail.... 7 EuroSys 2013

how does omega work? S0 S1 8 EuroSys 2013
Consider this example. Here, we have our shared cluster state in beige. This represents the ground truth, i.e. which tasks are actually assigned to which machines. We have two schedulers, shown in red and blue here. Each scheduler contains its own local replica of the cluster state, which is frequently updated to reflect changes. Now, let's see what happens if jobs arrive at the red and blue scheduler... 8 EuroSys 2013

S0 S1 how does omega work? 9 EuroSys 2013
In the red scheduler, the tasks come in, and are considered for scheduling. Using red's specific scheduling logic, it decides on two machines to place these tasks on. Then, it sends a delta to the shared cluster state, asking for its scheduling change to be committed. 9 EuroSys 2013

S0 S1 how does omega work? EuroSys 2013 EuroSys 2013
Meanwhile, the blue scheduler has been busy, too: it considers two tasks, finds machines to place them on using its specialized scheduling logic, and sends a delta to shared state. EuroSys 2013 EuroSys 2013

how does omega work? Conflict!
In the shared cluster state, the deltas are applied. However, in this case, both tried to place a task on the same machine! This leads to a conflict: two schedulers are trying to change the same machine concurrently, which could lead to over-commit of that machine. Conflict! 11 EuroSys 2013

how does omega work? S0 S1 failure! success! 12 EuroSys 2013
Only one delta can be applied successfully -- in this case, the blue one makes it, and blue's tasks start running. The red delta, however, fails to apply, and a conflict indication is returned to the red scheduler, who may try again after updating its state replica. Clearly, these conflicts lead to wasted work, and may be detrimental to scheduling performance. The Omega model's viability thus hinges on how often conflicts occur, and how well we can avoid them. 12 EuroSys 2013

2) workload characterization 3) comparison of approaches
overview 1) intro & motivation 2) workload characterization 3) comparison of approaches 4) trace-based simulation 5) flexibility case study After this brief explanation of how the Omega scheduling architecture works, I'll now use a set of practical case studies to investigate how viable the Omega model is and how it compares to the alternatives. Before I go into details, we need to take a look at the workload setup we'll be using, though. 13 EuroSys 2013

workload: batch/service split
We'll be looking at a simple two-way workload split into batch and service jobs initially. This is a real differentiation that we're doing in Omega, and it is also a challenging one: Batch jobs that run for some time and finish, while Service jobs are those production jobs that require careful scheduling for resource optimality and which run for a very long time. Let's have a look at how this breaks up the real-world cluster workload. 14 EuroSys 2013

Cluster A Medium size Medium utilization Cluster B Large size Medium utilization Cluster C Medium (12k mach.) High utilization Public trace We had a look at three representative Google clusters, denoted A, B and C here. They differ a bit in size and utilization; importantly, cluster C is also the one for which Google released a public trace last year. We used the same time period as the public trace, so the workload is identical to the published one. The bar chart shows, for each cluster, the relative shares batch jobs, which are in solid colour, and service jobs, which are the dotted portion. Jobs/tasks: counts CPU/RAM: resource seconds [i.e. resource job runtime in sec.] 15 EuroSys 2013

Cluster A Medium size Medium utilization Cluster B Large size Medium utilization Cluster C Medium size High utilization Public trace TAKEAWAY Most jobs are batch, but most resources are consumed by service jobs. There's a very clear take-away here: by job and task count, almost all jobs are batch, but the majority of resource-time is devoted to service jobs. Jobs/tasks: counts CPU/RAM: resource seconds [i.e. resource job runtime in sec.] 16 EuroSys 2013

Batch jobs Service jobs 80th %ile runtime 12-20 min. 29 days 80th %ile inter-arrival time Individual batch and service jobs have different properties, too. Batch jobs are much shorter than service ones, and they arrive a lot more often. This is why we want to handle them differently. 4-7 sec. 2-15 min. 17 EuroSys 2013

overview 1) intro & motivation 2) workload characterization 3) comparison of approaches 4) trace-based simulation 5) flexibility case study So, let's see how the Omega approach compares to the competition! 18 EuroSys 2013

methodology: simulation
simulation using empirical workload parameters distributions For this comparison, we wrote a simple simulator that can simulate monolithic schedulers, Mesos-style schedulers and Omega. It simplifies things a little: for example, it generates jobs based on empirical parameter distributions and uses a simple placement algorithm, but this lets us implement all of the architectures. We are making the source code of this simulator available for others to use. Code [soon to be] available: 19 EuroSys 2013

parameters Scheduler decision time
ttask: per-task (usually 0.005s per task) tjob: constant (usually 0.1s per job) One important parameter for our simulation is the scheduler decision time, that is, the time it takes the scheduling logic to make a decision for each job. We model the decision time using a constant per-job element and an element that is linear in the number of tasks, so that larger jobs take longer to schedule. TODO: we have up to thousands, remember! If not otherwise specified, we set t_job to 100ms and t_task to 5ms in all of our experiments. These values are extremely conservative worst-case bounds on how long Google's current scheduling logic runs for; in practice, it would usually be much faster, so we're making things deliberately difficult for Omega here. n: num. tasks 20 EuroSys 2013

How do does the shared-state design compare with other architectures?
Experiment 1: How do does the shared-state design compare with other architectures? Experiment details: all clusters, 7 simulated days 2 schedulers varying Service scheduler So, let's see how we do! 21 EuroSys 2013

monolithic, uniform decision time (single logic)
red => unscheduled jobs remained time spent scheduling total time scheduler busyness blue => all jobs were scheduled We start off with a monolithic scheduler that only has a single scheduling logic, which it applies to all jobs. We call this "single-path". On this diagram, we vary the per-job constant scheduling time and the per-task scheduling time on the X and Y axes, while the Z-axis shows the scheduler utilization. This demands a bit of explanation: scheduler utilization indicates what fraction of time the scheduler is busy making decisions. Once the value approaches 1, the scheduler is overwhelmed and can't keep up with the workload. This is what causes the red indication of unscheduled jobs here. In this case, we are varying the decision time for all jobs, as there is only a single scheduling logic. This is a useful baseline, but clearly not very realistic. tjob for ALL jobs ttask for ALL jobs 22 EuroSys 2013

monolithic, fast-path batch decision time
scheduler busyness head-of-line blocking So let's consider something more realistic: a "multi-path" monolithic scheduler, which has a fast-path for batch jobs. We now vary the decision times for service jobs only on the X and Y axes. Utilization drops by a lot, and we only see unscheduled jobs if we take 100s per job plus 1s for each task. That's clearly pushing it. Now, an ideal Omega implementation without any conflicts would produce a similar-looking graph. We'll see -- let's first have a look at a Mesos-style two-level scheduler. tjob for service jobs ttask for service jobs 23 EuroSys 2013

Ooops... mesos v0.9 (of May 2012) scheduler busyness
Huh? <CLICK> That's not quite what we expected! The scheduler utilization is low, but yet there are unscheduled jobs in every case. We were a little confused by this result at first, but it actually makes sense. To understand, we need to look at how Mesos works. tjob for service jobs ttask for service jobs 24 EuroSys 2013

mesos 1. Green receives offer of all available resources.
2. Blue's task finishes. RESOURCE MANAGER 3. Blue receives tiny offer. 4. Blue cannot use it. [repeat many times] We have a Mesos resource manager, and three schedulers, which receive offers in turn. Let's see what happens. <CLICK> First, S2 receives an offer. This offer contains all available cluster resources, as Mesos guarantees dominant resource fairness and offers all resources in turn. <CLICK> Now, while S2 is making its scheduling decisions -- which may take a long time -- a task of S1 finishes. At this point, the resource manager will make a tiny offer of the newly available resources to S1. <CLICK> However, S1 cannot make use of this offer, as is trying to schedule 3 tasks, and only one machine is available to it. So it does not schedule anything, and the offer returns. It may be offered again immediately after -- <CLICK> this cycle can iterate many times while S2 is still making its decision. <CLICK> Eventually, S2 finishes scheduling a job, and <CLICK> S1 now finally receives a suitably large offer. However, by this point it has already given up on the job it was trying to schedule, as it suffered too many retries, or too much time elapsed. 5. Green finishes scheduling. 6. Blue receives large offer. By now, it has given up. 25 EuroSys 2013

omega, no optimizations
scheduler busyness So here's Omega. Remember: our main goal was additional flexibility. But it doesn't perfom too badly. Just not as well as we like. tjob for service jobs ttask for service jobs 26 EuroSys 2013

omega, optimized scheduler busyness tjob for service jobs
So we did a few optimizations, and things got much better. The utilization is low, indicating that the the service scheduler can easily keep up with the workload. tjob for service jobs ttask for service jobs 27 EuroSys 2013

omega, optimized TAKEAWAY The Omega shared-state model performs as well as a (complex) monolithic multi-path scheduler. This is a nice result: despite having to do more work, checking for conflicts and re-trying if they occur, the scheduler utilization is not significantly worse than with a monolithic multi-path scheduler, as we can see by comparing the graphs at the bottom. Monolithic Mesos Omega 28 EuroSys 2013

Does the shared-state design scale to many schedulers?
Experiment 2: Does the shared-state design scale to many schedulers? Experiment details: cluster B, 7 simulated days 2 schedulers varying job arrival rate and number of schedulers Now, you might be thinking, well, two schedulers is not a very taxing setup, is it? We thought so, too, and decided to do an experiment where we vary the workload and the number of schedulers. 29 EuroSys 2013

scaling to many schedulers
In this experiment, we vary batch workload on the X-axis: "4x" means four times as many batch jobs arrive within the same time frame. We load-balance this batch workload across a variable number of batch schedulers using a simple hashing function. As there are more jobs, there is more scheduling work to be done. Consequently, the scheduler utilization goes up towards the right. However, adding more schedulers ameliorates this: load-balancing across schedulers reduces the average utilization. As there are more schedulers, there are also additional opportunities for conflicts to arise. The graceful scaling towards the right hand end suggests that this isn't happening, but we made sure to check... 30 EuroSys 2013

overview 1) intro & motivation 2) workload characterization 3) comparison of approaches 4) trace-based simulation 5) flexibility case study This is all good news for Omega, but we were wondering if this really works at Google scale, and with all the complexities inherent to real production systems. We thus also ran a high-fidelity simulation with real workload traces. 31 EuroSys 2013

lightweight simulator high-fidelity simulator
simulator comparison lightweight simulator high-fidelity simulator machines homogeneous real-world empirical distribution workload trace job parameters constraints not supported supported The high-fidelity simulator is much more complex than the simple simulator, and in addition to replaying real workloads, it supports heterogeneous machines, a complex resource model, constraints and uses the real Google production scheduling algorithm. As a result, it also takes much longer to run. scheduling algorithm Google algorithm random first fit runtime fast (24h ≃ 5min) slow (24h ≃ 2h) 32 EuroSys 2013

How much scheduler interference do we see with real Google workloads?
Experiment 3: How much scheduler interference do we see with real Google workloads? Experiment details: cluster C, 29 days 2 schedulers, non-uniform decision time varying Service scheduler Let's see how things work out with a month-long trace of cluster C. 33 EuroSys 2013

scheduler busyness scheduler busyness 34 EuroSys 2013
overhead due to conflicts scheduler busyness Here, we vary the per-job decision time for service jobs on the X-axis, and again show scheduler utilization on the Y-axis. The batch scheduler's utilization is shown in blue, while the service scheduler's utilization is in red. We also show a dotted line that approximates what the ideal curve would look like if there were no conflicts. <CLICK> As we can see, there is a huge overhead due to conflicts once we go beyond, say 30s of decision time per service job. Let's see exactly how much. 34 EuroSys 2013

Interference is higher for real-world settings.
scheduler busyness TAKEAWAY Interference is higher for real-world settings. overhead due to conflicts Here, we vary the per-job decision time for service jobs on the X-axis, and again show scheduler utilization on the Y-axis. The batch scheduler's utilization is shown in blue, while the service scheduler's utilization is in red. We also show a dotted line that approximates what the ideal curve would look like if there were no conflicts. <CLICK> As we can see, there is a huge overhead due to conflicts once we go beyond, say 30s of decision time per service job. Let's see exactly how much. 35 EuroSys 2013

optimizations 1. Fine-grained conflict detection
# 2. Incremental commits So, all for nothing then? Not quite. We looked at this, and thought about some optimizations that we could make to the Omega approach in order to reduce the number of conflicts. <CLICK>First, we moved from a straightforward conflict detection mechanism based on sequence numbers that would increment <CLICK> on task placement to <CLICK> fine-grained notion of conflicts, checking each time whether deltas feasibly commute, <CLICK> rather than just checking if the machine was touched. This should reduce the number of spurious conflicts detected. <CLICK>Second, we stopped gang-scheduling and added support for incremental commits: <CLICK>when only some tasks in a large job experience a conflict, <CLICK> the others may start running, and <CLICK> only the conflicted ones are re-tried. This should reduce the size of the retry commit, and thus make it more likely to succeed. 1st 2nd 36 EuroSys 2013

How do the optimizations affect performance?
Experiment 4: How do the optimizations affect performance? Experiment details: cluster C, 29 days 2 schedulers, non-uniform decision time varying Service scheduler How much impact do these changes have? 37 EuroSys 2013

impact on scheduler utilization
In this experiment, we compare the effect of the two optimizations on scheduler busyness. Fine grained conflict detection, shown in cyan, approximately halves the number of conflicts seen from the baseline in red. If we add incremental commit on top of this, the scheduler busyness is halved again, as shown by the purple line. The average number of conflicts per job is well below 1 per job, even for decision times over 100s, at this point. In this graph, we compare the impact of the optimizations on scheduler utilization. As we would expect, the benefits from a reduced number of conflicts translate into lower scheduler utilization across the board. 38 EuroSys 2013

practical implications – scheduler utilization
scheduler busyness overhead due to conflicts What does this look like on our previous graph, which compared the observed scheduler utilization to the ideal no-conflict case? Here it is. As we can see, the overhead due to conflicts is now relatively minor, and scales much more pleasantly than when it skyrocketed before. 39 EuroSys 2013

practical implications – scheduler utilization
TAKEAWAY We can make simple improvements that significantly improve scalability. overhead due to conflicts What does this look like on our previous graph, which compared the observed scheduler utilization to the ideal no-conflict case? Here it is. As we can see, the overhead due to conflicts is now relatively minor, and scales much more pleasantly than when it skyrocketed before. 40 EuroSys 2013

MapReduce scheduler with opportunistic extra resources
Case study MapReduce scheduler with opportunistic extra resources Finally, we would also like to know if there are any qualitative benefits of moving to Omega's shared-state architecture for independent schedulers. To find out, we did a simple case study, in which we looked at a MapReduce scheduler that can introspect cluster state in order to opportunistically harness idle resources. 41 EuroSys 2013

Count of jobs with X workers
workers in MR jobs 200 11 5 1000 50 3 450 100 Count of jobs with X workers 8000 The motivation for this idea is that the number of workers per MapReduce job is a manually configured value at Google. Number of workers [log10] Snapshot over 29 days

case study: a MapReduce scheduler
60% of MapReduces 3-4x speedup! Fraction of MR jobs with speedup < X Very simple linear speedup model, speedup on x-axis, CDF on y-axis Green = happy zone 60% of MRs benefit 3-4x speedup in the 50th percentile tail maybe less realistic, >100x speedups probably result of naive linear model Relative speedup [log10] better cluster C, 29 days 43 EuroSys 2013

TAKEAWAY The Omega approach gives us the flexibility to easily support custom policies. 60% of MapReduces Fraction of MR jobs with speedup < X 3-4x speedup! Relative speedup [log10] better cluster C, 29 days 44 EuroSys 2013

Flexibility and scale require parallelism,
conclusion TAKEAWAYS Flexibility and scale require parallelism, parallel scheduling works if you do it right, and using shared state is the way to do it right! So, let's finish up -- in this work, we have shown that, for real Google workloads, our needs for flexibility and scale require parallelism in the cluster scheduler. Parallel scheduling with full state exposure to all schedulers is tricky, but works if you do it right. The right way to do it is to use shared state and optimistic concurrency! Thank you very much! 45 EuroSys 2013

BACKUP SLIDES EuroSys 2013

Why might scheduling take 60 seconds?
scheduling policies Why might scheduling take 60 seconds? Large jobs (1,000s of tasks) Optimization algorithms (constraint solving, bin packing) Very picky jobs in a full cluster (preemption consequences) Monte Carlo simulations (fault tolerance) For this comparison, we wrote a simple simulator that can simulate monolithic schedulers, Mesos-style schedulers and Omega. It simplifies things a little: for example, it generates jobs based on empirical parameter distributions and uses a simple placement algorithm, but this lets us implement all of the architectures. We are making the source code of this simulator available for others to use. 48 EuroSys 2013

methodology: simulation
Initial cluster state empirical distribution Experiment configuration Workload Batch Event-driven simulator Service ... Cluster state For this comparison, we wrote a simple simulator that can simulate monolithic schedulers, Mesos-style schedulers and Omega. It simplifies things a little: for example, it generates jobs based on empirical parameter distributions and uses a simple placement algorithm, but this lets us implement all of the architectures. We are making the source code of this simulator available for others to use. MapReduce Code [soon to be] available: 19 EuroSys 2013

workload: job runtime distributions
Batch Service Fraction of jobs running for less than X Let's drill into this a little more, and consider the job runtime distributions for the two types. This is a cumulative distribution, so this point indicates that about 25% of service jobs in cluster C ran for less than an hour, and while over 90% of batch jobs finished within an hour. The service lines do not reach 1.0 at the right hand edge because some jobs ran for more than 29 days. EuroSys 2013

workload: job runtime distributions
Batch Service TAKEAWAY Service jobs, once scheduled, run for much longer than batch jobs do. Fraction of jobs running for less than X Again, the take-away is clear: service jobs run for much longer than batch jobs do, so we need to make good scheduling decisions for them! EuroSys 2013

workload: inter-arrival time distributions
Service Batch Fraction of inter-arrival gaps less than X Next, we look at the inter-arrival time for the two job types. This is the time that elapses between two neighbouring jobs of this type arriving. As we can see, the batch inter-arrival time is always much less than minute, while service jobs arrive fairly infrequently. EuroSys 2013

workload: inter-arrival time distributions
Service Batch TAKEAWAY Service jobs arrive much less frequently than batch jobs do. Fraction of inter-arrival gaps less than X This leads to the obvious take-away: service job turnover is much lower than batch turnover. This gives us sufficient time to make complex scheduling decisions for service jobs. EuroSys 2013

the omega approach Shared state
Deltas against shared state Easy to develop & maintain Heterogeneous scheduling logic supported CLUSTER STATE Optimistic concurrency No explicit coordination required Post-hoc interference resolution Scales well EuroSys 2013

num. conflicts total num. transactions
conflict fraction num. conflicts total num. transactions This graph shows, on the same X-axis, the average number of conflicts per job before it successfully schedules. Up to about 30s of per-job decision time, we have less than a single conflict per job, on average. Beyond 30s, however, the number of conflicts skyrockets, and we already see about 7 conflicts for each successful service job scheduling event at 100s. EuroSys 2013

impact on conflict fraction
In this experiment, we compare the average number of conflicts per successfully scheduled job. Fine grained conflict detection, shown in purple, approximately halves the number of conflicts seen. If we add incremental commit on top of this, the per-job conflicts are halved again, and the average stays below 1 conflict per job, even for decision times over 100s. EuroSys 2013

50% of MapReduces Fraction of MR jobs with speedup < X 4.5x speedup Relative speedup [log10] cluster A, 29 days EuroSys 2013

caveats, or when this won't work well
Possible problems... aggressive, systematically adverse workloads or schedulers small clusters with high overcommit Now, while our experiments are generally very promising, we do believe that there are situations in which the Omega model will run into trouble. deal with using out-of-band or post-facto enforcement mechanisms EuroSys 2013

Omega: flexible, scalable schedulers for large compute clusters

Similar presentations

Presentation on theme: "Omega: flexible, scalable schedulers for large compute clusters"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Omega: flexible, scalable schedulers for large compute clusters

Similar presentations

Presentation on theme: "Omega: flexible, scalable schedulers for large compute clusters"— Presentation transcript:

Similar presentations

About project

Feedback