Large-scale cluster management at Google with Borg

Name: Large-scale cluster management at Google with Borg
Uploaded: 2017-12-24T20:04:20+00:00
Duration: PTM9S30
Channel: Sharlene Tyler
Description: Large-scale cluster management at Google with Borg

Large-scale cluster management at Google with Borg
Paper by: A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, J. Wilkes Katie Wcisel CS345 - Fall 2016

What is Borg and who are its users?
Borg is a highly scalable, reliable cluster management system at Google that admits, schedules, starts, restarts, and monitors the full range of applications that Google runs The users of Borg are Google developers and system administrators that run Google’s applications and services

Architecture of Borg: Overview
A single Borg cell consists of machines in a single cluster Each cell is independent of the others The Borgmaster is the controller of the cell Each machine has its own Borglet, which communicates with the Borgmaster There is also a Fauxmaster

Architecture of Borg: The Borgmaster
Consists of two processes: The main Borgmaster process, which handles client remote procedure calls (RPCs) that either mutate state (e.g., create a job), provide read-only access (e.g., look up a job), and manages tasks, allocs, etc. A separate scheduler Each Borg cell has five copies of the Borgmaster, but only one is elected at any given point in time

Architecture of Borg: The Borglets
Each machine has a Borglet, which is a local Borg agent that is responsible for: starting, stopping and restarting tasks managing local resources reporting the current state of the machine to the Borgmaster If a Borglet becomes unreachable, the Borgmaster will reschedule its assigned tasks

Architecture of Borg: The Fauxmaster
A high-fidelity Borgmaster simulator used for: Debugging Capacity planning Sanity checks before making changes to a cell’s configuration

Borg is used for… Long running services that should never go down
Gmail, Google Docs, web search, internal infrastructure services like Big Table Batch jobs that take anywhere from a few seconds to a few days to complete Typically these are less sensitive to short-term performance fluctuations Within Borg, higher priority jobs are referred to as prod, and everything else is referred to as non-prod.

Jobs vs Tasks Work is submitted to Borg in the form of jobs
Jobs consist of one or more tasks running in a single Borg cell Job properties: name, owner, number of tasks Tasks are the smallest separable parts of a job. When scheduling is done, it is done on tasks, rather than on a job. Task properties: resource requirements and that task’s index within a job Users operate on jobs by issuing RPCs to Borg Can change the properties of some or all tasks in a job by pushing a new job configuration to Borg, but some changes will require tasks to be restarted or reassigned to a new machine

Scheduling: Priorities and quotas
Every job is assigned a priority (small positive integer) - this is used to determine which task should be preempted if there aren’t enough available resources to process both Prod tasks are not allowed to preempt one another, but are allowed to preempt non-prod tasks Quotas are expressed in a vector of resource quantities such as CPU, RAM, disk, etc. and are used to determine which jobs will be admitted for scheduling

Scheduling: General overview
A job is submitted The Borgmaster records it in the Paxos store and adds the tasks to a pending queue, which is scanned asynchronously by the scheduler The scheduler determines whether a machine with available resources for a task exists (feasibility) and uses scoring to choose which machine to run it on Looks to minimize preemptions and see which machine has most or all necessary packages installed already (time-intensive) The scheduler chooses the “best fit” of all available machines for the task

Scheduling: The scheduler
In order to maintain scalability, the scheduler uses a combination of: Score caching Evaluation of feasibility and scoring a machine are expensive, so Borg caches these scores until the properties of a task or machine change Consideration of equivalence classes Tasks in a Borg job usually have identical requirements and constraints, so Borg only does feasibility and scoring for one task per equivalence class Use of relaxed randomization The scheduler examines machines in a random order until it has found enough feasible machines to score, and selects the best from within that set These three measures enable the scheduling of an entire cell’s workload to be completed in a few hundred seconds. When not implemented, scheduling wasn’t completed even after three days.

Availability Applications that run on Borg are expected to handle failures, so Borg: Automatically reschedules evicted tasks Spreads tasks of a job across failure domains (machines, racks, power domains) Limits the allowed rate of task disruptions and the number of tasks that can be down simultaneously due to maintenance activities Uses desired-state representations and idempotent mutating operations so that failed clients can resubmit forgotten requests without harm Avoids repeating task-machine pairings that cause task or machine crashes Recovers critical intermediate data written to a local disk by repeatedly running a logsaver task

Availability: Borgmaster
Even if the Borgmaster or a task’s Borglet go down, already running tasks continue to run, but if the Borgmaster does go down: New jobs cannot be submitted, or existing ones updated Tasks from a failed machine cannot be rescheduled The Borgmaster achieves 99.99% availability through: Replication for machine failures (five copies) Admission control to avoid overload Deploying instances using simple, low-level tools to minimize external dependencies

Why use Borg? One of the primary goals of Borg is to make efficient use of Google’s data centers full of machines. The ability to increase utilization of these machines, even by a few percentage points, can translate into millions of dollars in savings. Efficiency (and thus savings) are achieved through: Cell compaction Cell sharing Resource reclamation Isolation (security and performance)

Cell compaction Given a workload, they found out how small a cell it could be fitted into by removing randomly selected machines until the workload no longer fit They used the Fauxmaster to obtain simulation results, using data from actual production cells They determined that cell compaction provides a fair, consistent way to compare scheduling policies, and better policies require fewer machines to run the same workload

Cell sharing Nearly all machines run both prod and non-prod tasks at the same time Not doing so would require approximately 20-30% more machines

Resource reclamation Jobs specify resource limits, which are upper bounds for the resources that a task should be granted, and Borg will kill a task that tries to use more resources than were requested Prod jobs often reserve extra resources in order to handle workload spikes Borg reclaims unused resources to run the non-prod tasks as these are able to be preempted if the resources are needed by a prod task

Security isolation and performance isolation
Google uses a chroot jail as a security isolation mechanism between multiple tasks on the same machine Performance isolation: Latency-sensitive jobs are able to reserve entire physical CPU cores, which prevents other latency-sensitive jobs from using them Batch jobs are able to run on any core as these can be preempted, if necessary. When a machine runs out of non-compressible resources (e.g., memory, disk space), it starts killing tasks from lowest to highest priority

Lessons learned Bad Good Jobs are restrictive as the only grouping mechanism for tasks One IP address per machine complicates things Optimizing for power users at the expense of casual ones Allocs are useful A reserved set of resources on a machine which remain reserved whether or not they are used Cluster management is more than task management Introspection is vital The master is the kernel of a distributed system

Large-scale cluster management at Google with Borg

Similar presentations

Presentation on theme: "Large-scale cluster management at Google with Borg"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large-scale cluster management at Google with Borg

Similar presentations

Presentation on theme: "Large-scale cluster management at Google with Borg"— Presentation transcript:

Similar presentations

About project

Feedback