Large-scale cluster management at Google with Borg By: Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, John Wilkes Presented.

Large-scale cluster management at Google with Borg By: Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, John Wilkes Presented by: Xianghan Pei EECS 582 – W161

Outline Introduction & Motivation The User Perspective Borg Architecture Scalability & Availability Utilization Isolation Lessons Future and Related Work EECS 582 – W162

Introduction & Motivation Introduction Borg: An OS of Cluster(Datacenter) Motivation Hide the details(programmer focus on App) Provide resource sharing Provide high reliability and availability for Cluster EECS 582 – W163 The high-level architecture of Borg

The User Perspective Workload production(prod) VS non-production(non-prod) Cells Heterogeneous, around 10 k machines in median Jobs and Tasks EECS 582 – W164 > borgcfg.../hello_world_webserver.borg up

The User Perspective Allocs Reserved set of resources Priority, Quota, and Admission Control Job has a priority (preempting) Quota is used to decide which jobs to admit for scheduling Naming and Monitoring 50.jfoo.ubar.cc.borg.google.com Monitoring health of the task and thousands of performance metrics EECS 582 – W165

Borg Architecture Borgmaster Main Borgmaster process & Scheduler Five replicas Borglet Manage and monitor tasks and resource Borgmaster polls Borglet every few seconds EECS 582 – W166 The high-level architecture of Borg(A Cell)

Borg Architecture Scheduling feasibility checking: find machines Scoring: pick one machines User preferences & build-in criteria E-PVM VS best-fit Tradeoff EECS 582 – W167 The high-level architecture of Borg(A Cell)

Scalability Separate scheduler Separate threads to poll the Borglets Partition functions across the five replicas Score caching Equivalence classes Relaxed randomization EECS 582 – W168 The high-level architecture of Borg(A Cell)

Availability System Borgmaster: Five replicas Borglet: Borgmaster Take checkpoint Job Reschedules evicted tasks Spreading tasks across failure domian … EECS 582 – W169

Utilization Cell Sharing Segregating prod and non-prod work into different cells would need more machines EECS 582 – W1610

Utilization Cell Size Subdividing cells into smaller ones would require more machines EECS 582 – W1611

Utilization Fine-grained resource requests EECS 582 – W1612 No bucket sizes fit most of the tasks wellBucketing resource would need more machines

Utilization Resource reclamation estimate how many resources a task will use and reclaim the rest for work Kill non-prod if not available EECS 582 – W1613

Utilization Resource reclamation Choose medium EECS 582 – W1614

Isolation Security isolation chroot jail as the primary security isolation mechanism Performance isolation High-priority LV task(prod) get best treatment compressible resources: reclaimed non-compressible resources: kill or remove task EECS 582 – W1615

Lessons Bad Jobs are restrictive as the only grouping mechanism for tasks One IP address per machine complicates things Optimizing for power users at the expense of casual ones Good Allocs are useful Cluster management is more than task management Introspection is vital The master is the kernel of a distributed system EECS 582 – W1616

Future Work Kubernetes Evolved from Borg An open-source system for automating deployment, operations, and scaling of containerized applications Pods: groups of containers Labels Replica controller Services EECS 582 – W1617

Relative Work Apache Mesos YARN Tupperware EECS 582 – W1618

Questions? EECS 582 – W1619

References Borg Paper Borg Slides at InfoQBorg Slides EECS 582 – W1620

Large-scale cluster management at Google with Borg By: Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, John Wilkes Presented.

Similar presentations

Presentation on theme: "Large-scale cluster management at Google with Borg By: Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, John Wilkes Presented."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large-scale cluster management at Google with Borg By: Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, John Wilkes Presented.

Similar presentations

Presentation on theme: "Large-scale cluster management at Google with Borg By: Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, John Wilkes Presented."— Presentation transcript:

Similar presentations

About project

Feedback