Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large-scale cluster management at Google with Borg Google Inc.

Similar presentations


Presentation on theme: "Large-scale cluster management at Google with Borg Google Inc."— Presentation transcript:

1 Large-scale cluster management at Google with Borg Google Inc.

2 Agenda Borg 的目標 (1, 2.1) 使用者怎麼描述工作 / 任務對運算資源 的需求 (2.3, 2.5) 運算資源的分派單位 (2.2, 2.4) 排程算法與資源分配 (3.2, 6.2) 評估方式 (5.1) 實驗結果 (5.2, 5.4, 5.5) Lesson Learned (8)

3 Agenda Borg 的目標 (1, 2.1) 使用者怎麼描述工作 / 任務對運算資源 的需求 (2.3, 2.5) 運算資源的分派單位 (2.2, 2.4) 排程算法與資源分配 (3.2, 6.2) 評估方式 (5.1) 實驗結果 (5.2, 5.4, 5.5) Lesson Learned (8)

4 Borg A cluster manager. ◦ Runs hundreds of thousands of jobs from many thousands of different applications. ◦ Across a number of clusters each with up to tens of thousands of machines. With very high reliability and availability.

5 Workloads Heterogeneous workload with two main parts. ◦ Long-running services  Handle short-lived latency-sensitive requests.  High priority(prod). ◦ Batch jobs  Take a few seconds to a few days to complete.  Low priority(non-prod).

6 Agenda Borg 的目標 (1, 2.1) 使用者怎麼描述工作 / 任務對運算資源 的需求 (2.3, 2.5) 運算資源的分派單位 (2.2, 2.4) 排程算法與資源分配 (3.2, 6.2) 評估方式 (5.1) 實驗結果 (5.2, 5.4, 5.5) Lesson Learned (8)

7 Jobs and Tasks Job ◦ Runs in one Borg cell. ◦ Consist of many tasks. ◦ Has properties and constraints.  name, owner, number of tasks, priority. Task ◦ Maps to a set of Linux processes running in a container on a machine. ◦ Has properties and constraints.  resource requirements(CPU cores, RAM, disk space, disk access rate, TCP ports, etc).

8 Jobs and Tasks(Cont.)

9 Non-overlapping priority bands ◦ Monitoring, production, batch, and best effort. ◦ Tasks from jobs with higher priority can preempt lower priority one. ◦ Disallow tasks in the production priority band to preempt one another.

10 Jobs and Tasks(Cont.) Jobs with insufficient quota are immediately rejected upon submission. ◦ Quota: a vector of resource quantities.  (CPU, RAM, disk space, etc.) ◦ Higher-priority quota costs more.

11 Agenda Borg 的目標 (1, 2.1) 使用者怎麼描述工作 / 任務對運算資源 的需求 (2.3, 2.5) 運算資源的分派單位 (2.2, 2.4) 排程算法與資源分配 (3.2, 6.2) 評估方式 (5.1) 實驗結果 (5.2, 5.4, 5.5) Lesson Learned (8)

12 Architecture(Cont.) Cell ◦ A set of heterogeneous machines that run jobs in a cluster. ◦ Median cell size: 10k machines. Alloc ◦ A reserved set of resources on a machine.

13 Agenda Borg 的目標 (1, 2.1) 使用者怎麼描述工作 / 任務對運算資源 的需求 (2.3, 2.5) 運算資源的分派單位 (2.2, 2.4) 排程算法與資源分配 (3.2, 6.2) 評估方式 (5.1) 實驗結果 (5.2, 5.4, 5.5) Lesson Learned (8)

14 Scheduler The scheduling algorithm consists of two parts. ◦ Feasibility checking: find machines on which the task could run. ◦ Scoring: picks one of the feasible machines.  Spreading load v.s. Best-fit  Use a hybrid method to reduce the amount of stranded resources – ones that cannot be used because of another resource on the machine is fully allocated.

15 Performance Isolation To help with overload and over-commitment. Latency-sensitive(LS) tasks v.s. the rest(batch). ◦ LS tasks are capable of temporarily starving batch tasks for several seconds. Compressible v.s. non-compressible resources. ◦ Terminates low priority tasks while running out of non-compressible. ◦ Throttles usage(favoring LS tasks) while running out of compressible.

16 Agenda Borg 的目標 (1, 2.1) 使用者怎麼描述工作 / 任務對運算資源 的需求 (2.3, 2.5) 運算資源的分派單位 (2.2, 2.4) 排程算法與資源分配 (3.2, 6.2) 評估方式 (5.1) 實驗結果 (5.2, 5.4, 5.5) Lesson Learned (8)

17 Combined vs Segregated

18 Agenda Borg 的目標 (1, 2.1) 使用者怎麼描述工作 / 任務對運算資源 的需求 (2.3, 2.5) 運算資源的分派單位 (2.2, 2.4) 排程算法與資源分配 (3.2, 6.2) 評估方式 (5.1) 實驗結果 (5.2, 5.4, 5.5) Lesson Learned (8)

19 Lesson Learned The bad: ◦ Jobs are restrictive as the only grouping mechanism for tasks. ◦ One IP address per machine complicates things. ◦ Optimizing for power users at the expense of casual ones.

20 Lesson Learned(Cont.) The good: ◦ Allocs are useful. ◦ Cluster management is more than task management. ◦ Introspection is vital. ◦ The master is the kernel of a distributed system.

21 Conclusion Virtually all of Google’s cluster workloads have switched to use Borg over the past decade. They continue to evolve it, and have applied the lessons we learned from it to Kubernetes.

22 Architecture

23 BorgMaster Consist of Borgmaster process and scheduler. Borgmaster process ◦ Handles client RPCs, manages state machines for all the objects in the system, communicates with the Borglet … etc. ◦ Five replicas, each maintains an in memory copy of the state of the cell.

24 Borglet A local Borg agent ◦ Start, stop, and restart tasks if they fail. ◦ Manages local resources by manipulating OS kernel settings. ◦ Report machine state to Borgmaster and other monitoring systems.


Download ppt "Large-scale cluster management at Google with Borg Google Inc."

Similar presentations


Ads by Google