Large-scale cluster management at Google with Borg Google Inc.

Agenda Borg 的目標 (1, 2.1) 使用者怎麼描述工作 / 任務對運算資源的需求 (2.3, 2.5) 運算資源的分派單位 (2.2, 2.4) 排程算法與資源分配 (3.2, 6.2) 評估方式 (5.1) 實驗結果 (5.2, 5.4, 5.5) Lesson Learned (8)

Borg A cluster manager. ◦ Runs hundreds of thousands of jobs from many thousands of different applications. ◦ Across a number of clusters each with up to tens of thousands of machines. With very high reliability and availability.

Workloads Heterogeneous workload with two main parts. ◦ Long-running services  Handle short-lived latency-sensitive requests.  High priority(prod). ◦ Batch jobs  Take a few seconds to a few days to complete.  Low priority(non-prod).

Jobs and Tasks Job ◦ Runs in one Borg cell. ◦ Consist of many tasks. ◦ Has properties and constraints.  name, owner, number of tasks, priority. Task ◦ Maps to a set of Linux processes running in a container on a machine. ◦ Has properties and constraints.  resource requirements(CPU cores, RAM, disk space, disk access rate, TCP ports, etc).

Jobs and Tasks(Cont.)

Non-overlapping priority bands ◦ Monitoring, production, batch, and best effort. ◦ Tasks from jobs with higher priority can preempt lower priority one. ◦ Disallow tasks in the production priority band to preempt one another.

Jobs and Tasks(Cont.) Jobs with insufficient quota are immediately rejected upon submission. ◦ Quota: a vector of resource quantities.  (CPU, RAM, disk space, etc.) ◦ Higher-priority quota costs more.

Architecture(Cont.) Cell ◦ A set of heterogeneous machines that run jobs in a cluster. ◦ Median cell size: 10k machines. Alloc ◦ A reserved set of resources on a machine.

Scheduler The scheduling algorithm consists of two parts. ◦ Feasibility checking: find machines on which the task could run. ◦ Scoring: picks one of the feasible machines.  Spreading load v.s. Best-fit  Use a hybrid method to reduce the amount of stranded resources – ones that cannot be used because of another resource on the machine is fully allocated.

Performance Isolation To help with overload and over-commitment. Latency-sensitive(LS) tasks v.s. the rest(batch). ◦ LS tasks are capable of temporarily starving batch tasks for several seconds. Compressible v.s. non-compressible resources. ◦ Terminates low priority tasks while running out of non-compressible. ◦ Throttles usage(favoring LS tasks) while running out of compressible.

Combined vs Segregated

Lesson Learned The bad: ◦ Jobs are restrictive as the only grouping mechanism for tasks. ◦ One IP address per machine complicates things. ◦ Optimizing for power users at the expense of casual ones.

Lesson Learned(Cont.) The good: ◦ Allocs are useful. ◦ Cluster management is more than task management. ◦ Introspection is vital. ◦ The master is the kernel of a distributed system.

Conclusion Virtually all of Google’s cluster workloads have switched to use Borg over the past decade. They continue to evolve it, and have applied the lessons we learned from it to Kubernetes.

Architecture

BorgMaster Consist of Borgmaster process and scheduler. Borgmaster process ◦ Handles client RPCs, manages state machines for all the objects in the system, communicates with the Borglet … etc. ◦ Five replicas, each maintains an in memory copy of the state of the cell.

Borglet A local Borg agent ◦ Start, stop, and restart tasks if they fail. ◦ Manages local resources by manipulating OS kernel settings. ◦ Report machine state to Borgmaster and other monitoring systems.

Large-scale cluster management at Google with Borg Google Inc.

Similar presentations

Presentation on theme: "Large-scale cluster management at Google with Borg Google Inc."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large-scale cluster management at Google with Borg Google Inc.

Similar presentations

Presentation on theme: "Large-scale cluster management at Google with Borg Google Inc."— Presentation transcript:

Similar presentations

About project

Feedback