Benjamin Hindman Apache Mesos Design Decisions

Slides:

Advertisements

Similar presentations

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.

Advertisements

Can’t We All Just Get Along? Sandy Ryza. Introductions Software engineer at Cloudera MapReduce, YARN, Resource management Hadoop committer.

THE DATACENTER NEEDS AN OPERATING SYSTEM MATEI ZAHARIA, BENJAMIN HINDMAN, ANDY KONWINSKI, ALI GHODSI, ANTHONY JOSEPH, RANDY KATZ, SCOTT SHENKER, ION STOICA.

Introduction CSCI 444/544 Operating Systems Fall 2008.

Spark: Cluster Computing with Working Sets

Hadoop YARN in the Cloud Junping Du Staff Engineer, VMware China Hadoop Summit, 2013.

Resource Management with YARN: YARN Past, Present and Future

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Why static is bad! Hadoop Pregel MPI Shared cluster Today: static partitioningWant dynamic sharing.

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

Outline | Motivation| Design | Results| Status| Future

Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

A Platform for Fine-Grained Resource Sharing in the Data Center

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

(Private) Cloud Computing with Mesos at Twitter Benjamin

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.

The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

Distributed System Services Fall 2008 Siva Josyula

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

A Platform for Fine-Grained Resource Sharing in the Data Center

{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Next Generation of Apache Hadoop MapReduce Owen

Presented by Qifan Pu With many slides from Ali’s NSDI talk Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica.

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.

Part III BigData Analysis Tools (YARN) Yuan Xue

1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.

BIG DATA/ Hadoop Interview Questions.

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI 11’ Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Apache Ignite Compute Grid Research Corey Pentasuglia.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

TensorFlow– A system for large-scale machine learning

Scalable containers with Apache Mesos and DC/OS

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Machine Learning Library for Apache Ignite

Introduction to Distributed Platforms

Chapter 10 Data Analytics for IoT

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016

Operating System Structure

Apache Hadoop YARN: Yet Another Resource Manager

CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017

Hadoop Clusters Tess Fulkerson.

Software Engineering Introduction to Apache Hadoop Map Reduce

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

湖南大学-信息科学与工程学院-计算机与科学系

Introduction Apache Mesos is a type of open source software that is used to manage the computer clusters. This type of software has been developed by the.

Chapter 2: Operating-System Structures

Cloud Computing Large-scale Resource Management

Chapter 2: Operating-System Structures

Presentation transcript:

Benjamin Hindman Apache Mesos Design Decisions

this is not a talk about YARN

at least not explicitly!

this talk is about Mesos!

a little history Mesos started as a research project at Berkeley in early 2009 by Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica

our motivation increase performance and utilization of clusters

our intuition ①static partitioning considered harmful

static partitioning considered harmful datacenter

static partitioning considered harmful

faster!

higher utilization! static partitioning considered harmful

our intuition ②build new frameworks

“Map/Reduce is a big hammer, but not everything is a nail!”

Apache Mesos is a distributed system for running and building other distributed systems

Mesos is a cluster manager

Mesos is a resource manager

Mesos is a resource negotiator

Mesos replaces static partitioning of resources to frameworks with dynamic resource allocation

Mesos is a distributed system with a master/slave architecture masters slaves

frameworks register with the Mesos master in order to run jobs/tasks masters slaves frameworks

frameworks can be required to authenticate as a principal masters SASL CRAM-MD5 secret mechanism (Kerberos in development) framework masters initialized with secrets

in early 2010 goal: run long-running services elastically on Mesos

Apache Aurora (incubating) masters Aurora is a Mesos framework that makes it easy to launch services written in Ruby, Java, Scala, Python, Go, etc!

masters Storm, Jenkins, …

a lot of interesting design decisions along the way

many appear (IMHO) in YARN too

design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++

design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++

frameworks get allocated resources from the masters masters framework resources are allocated via resource offers a resource offer represents a snapshot of available resources (one offer per host) that a framework can use to run tasks offer hostname 4 CPUs 4 GB RAM

frameworks use these resources to decide what tasks to run masters framework a task can use a subset of an offer task 3 CPUs 2 GB RAM

Mesos challenged the status quo of cluster managers

cluster manager status quo cluster manager application specification the specification includes as much information as possible to assist the cluster manager in scheduling and execution

cluster manager status quo cluster manager application wait for task to be executed

cluster manager status quo cluster manager application result

problems with specifications ①hard to specify certain desires or constraints ②hard to update specifications dynamically as tasks executed and finished/failed

an alternative model masters framework request 3 CPUs 2 GB RAM a request is purposely simplified subset of a specification, mainly including the required resources

question: what should Mesos do if it can’t satisfy a request?

① wait until it can …

question: what should Mesos do if it can’t satisfy a request? ① wait until it can … ② offer the best it can immediately

question: what should Mesos do if it can’t satisfy a request? ① wait until it can … ② offer the best it can immediately

an alternative model masters framework offer hostname 4 CPUs 4 GB RAM

offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM an alternative model masters framework offer hostname 4 CPUs 4 GB RAM

offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM an alternative model masters framework offer hostname 4 CPUs 4 GB RAM framework uses the offers to perform it’s own scheduling

an analogue: non-blocking sockets kernel application write(s, buffer, size);

an analogue: non-blocking sockets kernel application 42 of 100 bytes written!

resource offers address asynchrony in resource allocation

IIUC, even YARN allocates “the best it can” to an application when it can’t satisfy a request

requests are complimentary (but not necessary)

offers represent the currently available resources a framework can use

question: should resources within offers be disjoint?

masters framework1framework2 offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM

concurrency control optimisticpessimistic

concurrency control optimisticpessimistic all offers overlap with one another, thus causing frameworks to “compete” first-come-first-served

concurrency control optimisticpessimistic offers made to different frameworks are disjoint

Mesos semantics: assume overlapping offers

design comparison: Google’s Omega

the Omega model database framework snapshot a framework gets a snapshot of the cluster state from a database (note, does not make a request!)

the Omega model database framework transaction a framework submits a transaction to the database to “acquire” resources (which it can then use to run tasks) failed transactions occur when another framework has already acquired sought resources

isomorphism?

observation: snapshots are optimistic offers

Omega and Mesos database framework snapshot masters framework offer hostname 4 CPUs 4 GB RAM

Omega and Mesos database framework transaction masters framework task 3 CPUs 2 GB RAM

thought experiment: what’s gained by exploiting the continuous spectrum of pessimistic to optimistic? optimisticpessimistic

design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++

Mesos allocates resources to frameworks using a fair-sharing algorithm we created called Dominant Resource Fairness (DRF)

DRF, born of static partitioning datacenter

static partitioning across teams promotionstrends recommendations team

promotionstrends recommendations team fairly shared! static partitioning across teams

goal: fairly share the resources without static partitioning

partition utilizations promotionstrends recommendations 45% CPU 100% RAM 75% CPU 100% RAM 100% CPU 50% RAM team utilization

observation: a dominant resource bottlenecks each team from running any more jobs/tasks

dominant resource bottlenecks promotionstrends recommendations team utilization bottleneckRAM 45% CPU 100% RAM 75% CPU 100% RAM 100% CPU 50% RAM RAMCPU

insight: allocating a fair share of each team’s dominant resource guarantees they can run at least as many jobs/tasks as with static partitioning!

… if my team gets at least 1/N of my dominant resource I will do no worse than if I had my own cluster, but I might do better when resources are available!

DRF in Mesos masters framework ①frameworks specify a role when they register (i.e., the team to charge for the resources)

DRF in Mesos masters framework ①frameworks specify a role when they register (i.e., the team to charge for the resources) ②master calculates each role’s dominant resource (dynamically) and allocates appropriately

tep 4: Profit (statistical multiplexing) $

in practice, fair sharing is insufficient

weighted fair sharing promotionstrends recommendations team

weighted fair sharing promotionstrends recommendations team weight

Mesos implements weighted DRF masters masters can be configured with weights per role resource allocation decisions incorporate the weights to determine dominant fair shares

in practice, weighted fair sharing is still insufficient

a non-cooperative framework (i.e., has long tasks or is buggy) can get allocated too many resources

Mesos provides reservations slaves can be configured with resource reservations for particular roles (dynamic, time based, and percentage based reservations are in development) resource offers include the reservation role (if any) masters framework (trends) offer hostname 4 CPUs 4 GB RAM role: trends

reservations reservations provide guarantees, but at the cost of utilization

revocable resources masters framework (promotions) reserved resources that are unused can be allocated to frameworks from different roles but those resources may be revoked at any time offer hostname 4 CPUs 4 GB RAM role: trends

preemption via revocation … my tasks will not be killed unless I’m using revocable resources!

design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++

high-availability and fault- tolerance a ①framework failover ②master failover ③slave failover machine failure process failure (bugs!) upgrades

high-availability and fault- tolerance a ①framework failover ②master failover ③slave failover machine failure process failure (bugs!) upgrades

masters ①framework failover framework framework re-registers with master and resumes operation all tasks keep running across framework failover! framework

high-availability and fault- tolerance a ①framework failover ②master failover ③slave failover machine failure process failure (bugs!) upgrades

masters ②master failover framework after a new master is elected all frameworks and slaves connect to the new master all tasks keep running across master failover!

high-availability and fault- tolerance a ①framework failover ②master failover ③slave failover machine failure process failure (bugs!) upgrades

slave ③slave failover mesos-slave task

slave ③slave failover mesos-slave task

slave ③slave failover task

slave ③slave failover mesos-slave task

slave ③slave failover mesos-slave task

slave ③slave mesos-slave (large in-memory services, expensive to restart)

design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++

execution masters framework task 3 CPUs 2 GB RAM frameworks launch fine-grained tasks for execution if necessary, a framework can provide an executor to handle the execution of a task

slave executor mesos-slave executor task

slave executor mesos-slave executor task

slave executor mesos-slave executor task

goal: isolation

slave isolation mesos-slave executor task

slave isolation mesos-slave executor task containers

executor + task design means containers can have changing resource allocations

slave isolation mesos-slave executor task

slave isolation mesos-slave executor task

slave isolation mesos-slave executor task

slave isolation mesos-slave executor task

slave isolation mesos-slave executor task

slave isolation mesos-slave executor task

slave isolation mesos-slave executor task

making the task first-class gives us true fine-grained resources sharing

requirement: fast task launching (i.e., milliseconds or less)

virtual machines an anti-pattern

operating-system virtualization containers (zones and projects) control groups (cgroups) namespaces

isolation support tight integration with cgroups CPU (upper and lower bounds) memory network I/O (traffic controller, in development) filesystem (using LVM, in development)

statistics too rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using) for capacity planning (and oversubscription in development)

CPU upper bounds? in practice, determinism trumps utilization

design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++

requirements: ①performance ②maintainability (static typing) ③interfaces to low-level OS (for isolation, etc) ④interoperability with other languages (for library bindings)

garbage collection a performance anti-pattern

consequences: ①antiquated libraries (especially around concurrency and networking) ②nascent community

github.com/3rdparty/libprocess concurrency via futures/actors, networking via message passing

github.com/3rdparty/stout monads in C++, safe and understandable utilities

but …

scalability simulations to 50,000+ slaves

@twitter we run multiple Mesos clusters each with nodes

design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++

final remarks

frameworks Hadoop (github.com/mesos/hadoop) Spark (github.com/mesos/spark) DPark (github.com/douban/dpark) Storm (github.com/nathanmarz/storm) Chronos (github.com/airbnb/chronos) MPICH2 (in mesos git repository) Marathon (github.com/mesosphere/marathon) Aurora (github.com/twitter/aurora)

write your next distributed system with Mesos!

port a framework to Mesos write a “wrapper” ~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features) see github.com/mesos/hadoop

Thank You! mesos.apache.org

master ②master failover framework after a new master is elected all frameworks and slaves connect to the new master all tasks keep running across master failover!

stateless master to make master failover fast, we choose to make the master stateless state is stored in the leaves, at the frameworks and the slaves makes sense for frameworks that don’t want to store state (i.e., can’t actually failover) consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)

master failover to make master failover fast, we choose to make the master stateless state is stored in the leaves, at the frameworks and the slaves makes sense for frameworks that don’t want to store state (i.e., can’t actually failover) consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)

Apache Mesos is a distributed system for running and building other distributed systems

origins Berkeley research project including Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica mesos.apache.org/documentation

ecosystem mesos developers operators framework developers

a tour of mesos from different perspectives of the ecosystem

the operator

People who run and manage frameworks (Hadoop, Storm, MPI, Spark, Memcache, etc) Tools: virtual machines, Chef, Puppet (emerging: PAAS, Docker) “ops” at most companies (SREs at Twitter) the static partitioners

for the operator, Mesos is a cluster manager

for the operator, Mesos is a resource manager

for the operator, Mesos is a resource negotiator

for the operator, Mesos replaces static partitioning of resources to frameworks with dynamic resource allocation

for the operator, Mesos is a distributed system with a master/slave architecture masters slaves

frameworks/applications register with the Mesos master in order to run jobs/tasks masters slaves

frameworks can be required to authenticate as a principal* masters SASL CRAM-MD5 secret mechanism (Kerberos in development) framework masters initialized with secrets

Mesos is highly-available and fault-tolerant

the framework developer

…

Mesos uses Apache ZooKeeper for coordination masters slaves Apache ZooKeeper

increase utilization with revocable resources and preemption masters framework1 hostname: 4 CPUs 4 GB RAM role: - framework2framework3

optimistic vs pessimistic what to say here …

authorization* principals can be used for: authorizing allocation roles authorizing operating system users (for execution)

authorization

agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies

agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies

I’d love to answer some questions with the help of my data!

I think I’ll try Hadoop.

your datacenter

+ Hadoop

happy?

Not exactly …

… Hadoop is a big hammer, but not everything is a nail!

I’ve got some iterative algorithms, I want to try Spark!

datacenter management

static partitioning

static partitioning considered harmful

(1)hard to share data (2)hard to scale elastically (to exploit statistical multiplexing) (3)hard to fully utilize machines (4)hard to deal with failures

static partitioning considered harmful (1)hard to share data (2)hard to scale elastically (to exploit statistical multiplexing) (3)hard to fully utilize machines (4)hard to deal with failures

Hadoop … (map/reduce) (distributed file system)

HDFS

Could we just give Spark it’s own HDFS cluster too?

HDFS x 2

tee incoming data (2 copies)

HDFS x 2 tee incoming data (2 copies) periodic copy/sync

That sounds annoying … let’s not do that. Can we do any better though?

HDFS

static partitioning considered harmful (1)hard to share data (2)hard to scale elastically (to exploit statistical multiplexing) (3)hard to fully utilize machines (4)hard to deal with failures

During the day I’d rather give more machines to Spark but at night I’d rather give more machines to Hadoop!

datacenter management

static partitioning considered harmful (1)hard to share data (2)hard to scale elastically (to exploit statistical multiplexing) (3)hard to fully utilize machines (4)hard to deal with failures

datacenter management

static partitioning considered harmful (1)hard to share data (2)hard to scale elastically (to exploit statistical multiplexing) (3)hard to fully utilize machines (4)hard to deal with failures

datacenter management

I don’t want to deal with this!

the datacenter … rather than think about the datacenter like this …

… is a computer think about it like this …

datacenter computer applications resources filesystem

mesos applications resources filesystem kernel

mesos applications resources filesystem kernel

mesos frameworks resources filesystem kernel

Step 1: filesystem

Step 2: mesos run a “master” (or multiple for high availability)

Step 2: mesos run “slaves” on the rest of the machines

Step 3: frameworks

tep 4: profit $

tep 4: profit (statistical multiplexing) $

$

$

$

$

$ reduces CapEx and OpEx!

tep 4: profit (statistical multiplexing) $ reduces latency!

tep 4: profit (utilize) $

$

$

$

$

$

tep 4: profit (failures) $

$

$

agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies

agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies

mesos frameworks resources filesystem kernel

mesos frameworks resources kernel

resource allocation

reservations can reserve resources per slave to provide guaranteed resources requires human participation (ops) to determine what roles should be reserved what resources kind of like thread affinity, but across many machines (and not just for CPUs)

resource allocation

(1)allocate reserved resources to frameworks authorized for a particular role (2)allocate unused reserved resources and unused unreserved resources fairly amongst all frameworks according to their weights

preemption if a framework runs tasks outside of it’s reservations they can be preempted (i.e., the task killed and the resources revoked) for a framework running a task within its reservation

agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies

mesos frameworks kernel

framework ≈ distributed system

framework commonality run processes/tasks simultaneously (distributed) handle process failures (fault-tolerant) optimize performance (elastic)

framework commonality run processes/tasks simultaneously (distributed) handle process failures (fault-tolerant) optimize performance (elastic) coordinate execution

frameworks are execution coordinators

frameworks are execution schedulers

end-to-end principle “application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes” i.e., frameworks want to coordinate their tasks execution and they should be able to

framework anatomy frameworks

framework anatomy frameworks scheduling API

scheduling

i’d like to run some tasks!

scheduling here are some resource offers!

resource offers an offer represents the snapshot of available resources on a particular machine that a framework can use to run tasks schedulers pick which resources to use to run their tasks foo.bar.com: 4 CPUs 4 GB RAM

“two-level scheduling” mesos: controls resource allocations to schedulers schedulers: make decisions about what to run given allocated resources

concurrency control the same resources may be offered to different frameworks

concurrency control the same resources may be offered to different frameworks optimisticpessimistic no overlapping offersall overlapping offers

tasks the “threads” of the framework, a consumer of resources (cpu, memory, etc) either a concrete command line or an opaque description (which requires an executor)

tasks here are some resources!

tasks launch these tasks!

tasks

status updates

task status update!

status updates

task status update!

more scheduling

i’d like to run some tasks!

agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies

high-availability

high-availability (master)

task status update!

high-availability (master) i’d like to run some tasks!

high-availability (master)

high-availability (framework)

high-availability (slave)

agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies

resource isolation leverage Linux control groups (cgroups) CPU (upper and lower bounds) memory network I/O (traffic controller, in progress) filesystem (lvm, in progress)

resource statistics rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using) per task/executor statistics are collected (for all fork/exec’ed processes too!) can help with capacity planning

agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies

security Twitter recently added SASL support, default mechanism is CRAM-MD5, will support Kerberos in the short term

agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies

framework commonality run processes/tasks simultaneously (distributed) handle process failures (fault-tolerant) optimize performance (elastic)

framework commonality as a “kernel”, mesos provides a lot of primitives that make writing a new framework easier such as launching tasks, doing failure detection, etc, why re-implement them each time!?

case study: chronos distributed cron with dependencies developed at airbnb ~3k lines of Scala! distributed, highly available, and fault tolerant without any network programming!

analytics

analytics + services

case study: aurora “run 200 of these, somewhere, forever” developed at Twitter highly available (uses the mesos replicated log) uses a python DSL to describe services leverages service discovery and proxying (see Twitter commons)

frameworks Hadoop (github.com/mesos/hadoop) Spark (github.com/mesos/spark) DPark (github.com/douban/dpark) Storm (github.com/nathanmarz/storm) Chronos (github.com/airbnb/chronos) MPICH2 (in mesos git repository) Marathon (github.com/mesosphere/marathon) Aurora (github.com/twitter/aurora)

write your next distributed system with mesos!

port a framework to mesos write a “wrapper” scheduler ~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features) see github.com/mesos/hadoop

conclusions datacenter management is a pain

conclusions mesos makes running frameworks on your datacenter easier as well as increasing utilization and performance while reducing CapEx and OpEx!

conclusions rather than build your next distributed system from scratch, consider using mesos

conclusions you can share your datacenter between analytics and online services!

Questions?

aurora

framework commonality run processes simultaneously (distributed) handle process failures (fault-tolerance) optimize execution (elasticity, scheduling)

primitives scheduler – distributed system “master” or “coordinator” (executor – lower-level control of task execution, optional) requests/offers – resource allocations tasks – “threads” of the distributed system …

scheduler Apache Hadoop Chronos

scheduler (1) brokers for resources (2) launches tasks (3) handles task termination

brokering for resources (1) make resource requests 2 CPUs 1 GB RAM slave * (2) respond to resource offers 4 CPUs 4 GB RAM slave foo.bar.com

offers: non-blocking resource allocation exist to answer the question: “what should mesos do if it can’t satisfy a request?” (1) wait until it can (2) offer the best allocation it can immediately

offers: non-blocking resource allocation exist to answer the question: “what should mesos do if it can’t satisfy a request?” (1) wait until it can (2) offer the best allocation it can immediately

resource allocation Apache Hadoop Chronos request

resource allocation Apache Hadoop Chronos request allocator dominant resource fairness resource reservations

resource allocation Apache Hadoop Chronos request allocator dominant resource fairness resource reservations optimisticpessimistic

resource allocation Apache Hadoop Chronos request allocator dominant resource fairness resource reservations optimisticpessimistic no overlapping offersall overlapping offers

resource allocation Apache Hadoop Chronos offer allocator dominant resource fairness resource reservations

“two-level scheduling” mesos: controls resource allocations to framework schedulers schedulers: make decisions about what to run given allocated resources

end-to-end principle “application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes”

tasks either a concrete command line or an opaque description (which requires a framework executor to execute) a consumer of resources

task operations launching/killing health monitoring/reporting (failure detection) resource usage monitoring (statistics)

resource isolation cgroup per executor or task (if no executor) resource controls adjusted dynamically as tasks come and go!

case study: chronos distributed cron with dependencies built at airbnb

before chronos

single point of failure (and AWS was unreliable) resource starved (not scalable)

chronos requirements fault tolerance distributed (elastically take advantage of resources) retries (make sure a command eventually finishes) dependencies

chronos leverages the primitives of mesos ~3k lines of scala highly available (uses Mesos state) distributed / elastic no actual network programming!

after chronos

after chronos + hadoop

case study: aurora “run 200 of these, somewhere, forever” built at Twitter

before aurora static partitioning of machines to services hardware outages caused site outages puppet + monit ops couldn’t scale as fast as engineers

aurora highly available (uses mesos replicated log) uses a python DSL to describe services leverages service discovery and proxying (see Twitter commons)

after aurora power loss to 19 racks, no lost services! more than 400 engineers running services largest cluster has >2500 machines

Mesos Node Hadoop Node Spark Node MPI Storm Node Chronos

Mesos Node Hadoop Node Spark Node MPI Node …

Mesos Node Hadoop Node Spark Node MPI Storm Node …

Mesos Node Hadoop Node Spark Node MPI Storm Node Chronos …

tep 4: Profit (statistical multiplexing) $