Large-scale cluster management at Google with Borg

Slides:



Advertisements
Similar presentations
I/O Management and Disk Scheduling Chapter 11. I/O Driver OS module which controls an I/O device hides the device specifics from the above layers in the.
Advertisements

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Operating System Support Focus on Architecture
1 Virtual Memory vs. Physical Memory So far, all of a job’s virtual address space must be in physical memory However, many parts of programs are never.
User Level Interprocess Communication for Shared Memory Multiprocessor by Bershad, B.N. Anderson, A.E., Lazowska, E.D., and Levy, H.M.
Scheduling a Large DataCenter Cliff Stein Columbia University Google Research June, 2009 Monika Henzinger, Ana Radovanovic Google Research.
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
1 The Google File System Reporter: You-Wei Zhang.
Distributed File Systems
Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]
DCE (distributed computing environment) DCE (distributed computing environment)
Large-scale cluster management at Google with Borg Google Inc.
임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
CS 390 Unix Programming Environment
1 Copyright © 2005, Oracle. All rights reserved. Following a Tuning Methodology.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Next Generation of Apache Hadoop MapReduce Owen
Lecture 4 Page 1 CS 111 Summer 2013 Scheduling CS 111 Operating Systems Peter Reiher.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
Large-scale cluster management at Google with Borg By: Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, John Wilkes Presented.
Bigtable A Distributed Storage System for Structured Data.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Chapter 7 Memory Management
HERON.
Troubleshooting Tools
OpenPBS – Distributed Workload Management System
Data Center Infrastructure
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
CS 6560: Operating Systems Design
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Maximum Availability Architecture Enterprise Technology Centre.
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
PA an Coordinated Memory Caching for Parallel Jobs
AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.
Oracle Solaris Zones Study Purpose Only
Software Engineering Introduction to Apache Hadoop Map Reduce
CSI 400/500 Operating Systems Spring 2009
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
湖南大学-信息科学与工程学院-计算机与科学系
Omega: flexible, scalable schedulers for large compute clusters
Chapter 5: CPU Scheduling
Page Replacement.
Computer Architecture
Main Memory Background Swapping Contiguous Allocation Paging
O.S Lecture 14 File Management.
Operating systems Process scheduling.
CPU SCHEDULING.
Multithreaded Programming
Chapter 2: Operating-System Structures
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
Chapter 2: Operating-System Structures
Operating Systems: Internals and Design Principles, 6/E
Presentation transcript:

Large-scale cluster management at Google with Borg Paper by: A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, J. Wilkes Katie Wcisel CS345 - Fall 2016

What is Borg and who are its users? Borg is a highly scalable, reliable cluster management system at Google that admits, schedules, starts, restarts, and monitors the full range of applications that Google runs The users of Borg are Google developers and system administrators that run Google’s applications and services

Architecture of Borg: Overview A single Borg cell consists of machines in a single cluster Each cell is independent of the others The Borgmaster is the controller of the cell Each machine has its own Borglet, which communicates with the Borgmaster There is also a Fauxmaster

Architecture of Borg: The Borgmaster Consists of two processes: The main Borgmaster process, which handles client remote procedure calls (RPCs) that either mutate state (e.g., create a job), provide read-only access (e.g., look up a job), and manages tasks, allocs, etc. A separate scheduler Each Borg cell has five copies of the Borgmaster, but only one is elected at any given point in time

Architecture of Borg: The Borglets Each machine has a Borglet, which is a local Borg agent that is responsible for: starting, stopping and restarting tasks managing local resources reporting the current state of the machine to the Borgmaster If a Borglet becomes unreachable, the Borgmaster will reschedule its assigned tasks

Architecture of Borg: The Fauxmaster A high-fidelity Borgmaster simulator used for: Debugging Capacity planning Sanity checks before making changes to a cell’s configuration

Borg is used for… Long running services that should never go down Gmail, Google Docs, web search, internal infrastructure services like Big Table Batch jobs that take anywhere from a few seconds to a few days to complete Typically these are less sensitive to short-term performance fluctuations Within Borg, higher priority jobs are referred to as prod, and everything else is referred to as non-prod.

Jobs vs Tasks Work is submitted to Borg in the form of jobs Jobs consist of one or more tasks running in a single Borg cell Job properties: name, owner, number of tasks Tasks are the smallest separable parts of a job. When scheduling is done, it is done on tasks, rather than on a job. Task properties: resource requirements and that task’s index within a job Users operate on jobs by issuing RPCs to Borg Can change the properties of some or all tasks in a job by pushing a new job configuration to Borg, but some changes will require tasks to be restarted or reassigned to a new machine

Scheduling: Priorities and quotas Every job is assigned a priority (small positive integer) - this is used to determine which task should be preempted if there aren’t enough available resources to process both Prod tasks are not allowed to preempt one another, but are allowed to preempt non-prod tasks Quotas are expressed in a vector of resource quantities such as CPU, RAM, disk, etc. and are used to determine which jobs will be admitted for scheduling

Scheduling: General overview A job is submitted The Borgmaster records it in the Paxos store and adds the tasks to a pending queue, which is scanned asynchronously by the scheduler The scheduler determines whether a machine with available resources for a task exists (feasibility) and uses scoring to choose which machine to run it on Looks to minimize preemptions and see which machine has most or all necessary packages installed already (time-intensive) The scheduler chooses the “best fit” of all available machines for the task

Scheduling: The scheduler In order to maintain scalability, the scheduler uses a combination of: Score caching Evaluation of feasibility and scoring a machine are expensive, so Borg caches these scores until the properties of a task or machine change Consideration of equivalence classes Tasks in a Borg job usually have identical requirements and constraints, so Borg only does feasibility and scoring for one task per equivalence class Use of relaxed randomization The scheduler examines machines in a random order until it has found enough feasible machines to score, and selects the best from within that set These three measures enable the scheduling of an entire cell’s workload to be completed in a few hundred seconds. When not implemented, scheduling wasn’t completed even after three days.

Availability Applications that run on Borg are expected to handle failures, so Borg: Automatically reschedules evicted tasks Spreads tasks of a job across failure domains (machines, racks, power domains) Limits the allowed rate of task disruptions and the number of tasks that can be down simultaneously due to maintenance activities Uses desired-state representations and idempotent mutating operations so that failed clients can resubmit forgotten requests without harm Avoids repeating task-machine pairings that cause task or machine crashes Recovers critical intermediate data written to a local disk by repeatedly running a logsaver task

Availability: Borgmaster Even if the Borgmaster or a task’s Borglet go down, already running tasks continue to run, but if the Borgmaster does go down: New jobs cannot be submitted, or existing ones updated Tasks from a failed machine cannot be rescheduled The Borgmaster achieves 99.99% availability through: Replication for machine failures (five copies) Admission control to avoid overload Deploying instances using simple, low-level tools to minimize external dependencies

Why use Borg? One of the primary goals of Borg is to make efficient use of Google’s data centers full of machines. The ability to increase utilization of these machines, even by a few percentage points, can translate into millions of dollars in savings. Efficiency (and thus savings) are achieved through: Cell compaction Cell sharing Resource reclamation Isolation (security and performance)

Cell compaction Given a workload, they found out how small a cell it could be fitted into by removing randomly selected machines until the workload no longer fit They used the Fauxmaster to obtain simulation results, using data from actual production cells They determined that cell compaction provides a fair, consistent way to compare scheduling policies, and better policies require fewer machines to run the same workload

Cell sharing Nearly all machines run both prod and non-prod tasks at the same time Not doing so would require approximately 20-30% more machines

Resource reclamation Jobs specify resource limits, which are upper bounds for the resources that a task should be granted, and Borg will kill a task that tries to use more resources than were requested Prod jobs often reserve extra resources in order to handle workload spikes Borg reclaims unused resources to run the non-prod tasks as these are able to be preempted if the resources are needed by a prod task

Security isolation and performance isolation Google uses a chroot jail as a security isolation mechanism between multiple tasks on the same machine Performance isolation: Latency-sensitive jobs are able to reserve entire physical CPU cores, which prevents other latency-sensitive jobs from using them Batch jobs are able to run on any core as these can be preempted, if necessary. When a machine runs out of non-compressible resources (e.g., memory, disk space), it starts killing tasks from lowest to highest priority

Lessons learned Bad Good Jobs are restrictive as the only grouping mechanism for tasks One IP address per machine complicates things Optimizing for power users at the expense of casual ones Allocs are useful A reserved set of resources on a machine which remain reserved whether or not they are used Cluster management is more than task management Introspection is vital The master is the kernel of a distributed system