Benjamin Hindman Apache Mesos Design Decisions

Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos

this is not a talk about YARN

at least not explicitly!

this talk is about Mesos!

a little history Mesos started as a research project at Berkeley in early 2009 by Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica

our motivation increase performance and utilization of clusters

our intuition ①static partitioning considered harmful

static partitioning considered harmful datacenter

static partitioning considered harmful

faster!

higher utilization! static partitioning considered harmful

our intuition ②build new frameworks

“Map/Reduce is a big hammer, but not everything is a nail!”

Apache Mesos is a distributed system for running and building other distributed systems

Mesos is a cluster manager

Mesos is a resource manager

Mesos is a resource negotiator

Mesos replaces static partitioning of resources to frameworks with dynamic resource allocation

Mesos is a distributed system with a master/slave architecture masters slaves

frameworks register with the Mesos master in order to run jobs/tasks masters slaves frameworks

frameworks can be required to authenticate as a principal masters SASL CRAM-MD5 secret mechanism (Kerberos in development) framework masters initialized with secrets

Mesos @Twitter in early 2010 goal: run long-running services elastically on Mesos

Apache Aurora (incubating) masters Aurora is a Mesos framework that makes it easy to launch services written in Ruby, Java, Scala, Python, Go, etc!

masters Storm, Jenkins, …

a lot of interesting design decisions along the way

many appear (IMHO) in YARN too

design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++

frameworks get allocated resources from the masters masters framework resources are allocated via resource offers a resource offer represents a snapshot of available resources (one offer per host) that a framework can use to run tasks offer hostname 4 CPUs 4 GB RAM

frameworks use these resources to decide what tasks to run masters framework a task can use a subset of an offer task 3 CPUs 2 GB RAM

Mesos challenged the status quo of cluster managers

cluster manager status quo cluster manager application specification the specification includes as much information as possible to assist the cluster manager in scheduling and execution

cluster manager status quo cluster manager application wait for task to be executed

cluster manager status quo cluster manager application result

problems with specifications ①hard to specify certain desires or constraints ②hard to update specifications dynamically as tasks executed and finished/failed

an alternative model masters framework request 3 CPUs 2 GB RAM a request is purposely simplified subset of a specification, mainly including the required resources

question: what should Mesos do if it can’t satisfy a request?

① wait until it can …

question: what should Mesos do if it can’t satisfy a request? ① wait until it can … ② offer the best it can immediately

an alternative model masters framework offer hostname 4 CPUs 4 GB RAM

offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM an alternative model masters framework offer hostname 4 CPUs 4 GB RAM

offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM an alternative model masters framework offer hostname 4 CPUs 4 GB RAM framework uses the offers to perform it’s own scheduling

an analogue: non-blocking sockets kernel application write(s, buffer, size);

an analogue: non-blocking sockets kernel application 42 of 100 bytes written!

resource offers address asynchrony in resource allocation

IIUC, even YARN allocates “the best it can” to an application when it can’t satisfy a request

requests are complimentary (but not necessary)

offers represent the currently available resources a framework can use

question: should resources within offers be disjoint?

masters framework1framework2 offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM

concurrency control optimisticpessimistic

concurrency control optimisticpessimistic all offers overlap with one another, thus causing frameworks to “compete” first-come-first-served

concurrency control optimisticpessimistic offers made to different frameworks are disjoint

Mesos semantics: assume overlapping offers

design comparison: Google’s Omega

the Omega model database framework snapshot a framework gets a snapshot of the cluster state from a database (note, does not make a request!)

the Omega model database framework transaction a framework submits a transaction to the database to “acquire” resources (which it can then use to run tasks) failed transactions occur when another framework has already acquired sought resources

isomorphism?

observation: snapshots are optimistic offers

Omega and Mesos database framework snapshot masters framework offer hostname 4 CPUs 4 GB RAM

Omega and Mesos database framework transaction masters framework task 3 CPUs 2 GB RAM

thought experiment: what’s gained by exploiting the continuous spectrum of pessimistic to optimistic? optimisticpessimistic

Mesos allocates resources to frameworks using a fair-sharing algorithm we created called Dominant Resource Fairness (DRF)

DRF, born of static partitioning datacenter

static partitioning across teams promotionstrends recommendations team

promotionstrends recommendations team fairly shared! static partitioning across teams

goal: fairly share the resources without static partitioning

partition utilizations promotionstrends recommendations 45% CPU 100% RAM 75% CPU 100% RAM 100% CPU 50% RAM team utilization

observation: a dominant resource bottlenecks each team from running any more jobs/tasks

dominant resource bottlenecks promotionstrends recommendations team utilization bottleneckRAM 45% CPU 100% RAM 75% CPU 100% RAM 100% CPU 50% RAM RAMCPU

insight: allocating a fair share of each team’s dominant resource guarantees they can run at least as many jobs/tasks as with static partitioning!

… if my team gets at least 1/N of my dominant resource I will do no worse than if I had my own cluster, but I might do better when resources are available!

DRF in Mesos masters framework ①frameworks specify a role when they register (i.e., the team to charge for the resources)

DRF in Mesos masters framework ①frameworks specify a role when they register (i.e., the team to charge for the resources) ②master calculates each role’s dominant resource (dynamically) and allocates appropriately

tep 4: Profit (statistical multiplexing) $

in practice, fair sharing is insufficient

weighted fair sharing promotionstrends recommendations team

weighted fair sharing promotionstrends recommendations team weight 0.17 0.5 0.33

Mesos implements weighted DRF masters masters can be configured with weights per role resource allocation decisions incorporate the weights to determine dominant fair shares

in practice, weighted fair sharing is still insufficient

a non-cooperative framework (i.e., has long tasks or is buggy) can get allocated too many resources

Mesos provides reservations slaves can be configured with resource reservations for particular roles (dynamic, time based, and percentage based reservations are in development) resource offers include the reservation role (if any) masters framework (trends) offer hostname 4 CPUs 4 GB RAM role: trends

reservations reservations provide guarantees, but at the cost of utilization

revocable resources masters framework (promotions) reserved resources that are unused can be allocated to frameworks from different roles but those resources may be revoked at any time offer hostname 4 CPUs 4 GB RAM role: trends

preemption via revocation … my tasks will not be killed unless I’m using revocable resources!

high-availability and fault- tolerance a prerequisite @twitter ①framework failover ②master failover ③slave failover machine failure process failure (bugs!) upgrades

masters ①framework failover framework framework re-registers with master and resumes operation all tasks keep running across framework failover! framework

masters ②master failover framework after a new master is elected all frameworks and slaves connect to the new master all tasks keep running across master failover!

slave ③slave failover mesos-slave task

slave ③slave failover task

slave ③slave failover mesos-slave task

slave ③slave failover @twitter mesos-slave (large in-memory services, expensive to restart)

execution masters framework task 3 CPUs 2 GB RAM frameworks launch fine-grained tasks for execution if necessary, a framework can provide an executor to handle the execution of a task

slave executor mesos-slave executor task

goal: isolation

slave isolation mesos-slave executor task

slave isolation mesos-slave executor task containers

executor + task design means containers can have changing resource allocations

slave isolation mesos-slave executor task

making the task first-class gives us true fine-grained resources sharing

requirement: fast task launching (i.e., milliseconds or less)

virtual machines an anti-pattern

operating-system virtualization containers (zones and projects) control groups (cgroups) namespaces

isolation support tight integration with cgroups CPU (upper and lower bounds) memory network I/O (traffic controller, in development) filesystem (using LVM, in development)

statistics too rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using) used @twitter for capacity planning (and oversubscription in development)

CPU upper bounds? in practice, determinism trumps utilization

requirements: ①performance ②maintainability (static typing) ③interfaces to low-level OS (for isolation, etc) ④interoperability with other languages (for library bindings)

garbage collection a performance anti-pattern

consequences: ①antiquated libraries (especially around concurrency and networking) ②nascent community

github.com/3rdparty/libprocess concurrency via futures/actors, networking via message passing

github.com/3rdparty/stout monads in C++, safe and understandable utilities

but …

scalability simulations to 50,000+ slaves

@twitter we run multiple Mesos clusters each with 3500+ nodes

final remarks

frameworks Hadoop (github.com/mesos/hadoop) Spark (github.com/mesos/spark) DPark (github.com/douban/dpark) Storm (github.com/nathanmarz/storm) Chronos (github.com/airbnb/chronos) MPICH2 (in mesos git repository) Marathon (github.com/mesosphere/marathon) Aurora (github.com/twitter/aurora)

write your next distributed system with Mesos!

port a framework to Mesos write a “wrapper” ~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features) see http:// github.com/mesos/hadoop

Thank You! mesos.apache.org mesos.apache.org/blog @ApacheMesos

master ②master failover framework after a new master is elected all frameworks and slaves connect to the new master all tasks keep running across master failover!

stateless master to make master failover fast, we choose to make the master stateless state is stored in the leaves, at the frameworks and the slaves makes sense for frameworks that don’t want to store state (i.e., can’t actually failover) consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)

master failover to make master failover fast, we choose to make the master stateless state is stored in the leaves, at the frameworks and the slaves makes sense for frameworks that don’t want to store state (i.e., can’t actually failover) consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)

Apache Mesos is a distributed system for running and building other distributed systems

origins Berkeley research project including Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica mesos.apache.org/documentation

ecosystem mesos developers operators framework developers

a tour of mesos from different perspectives of the ecosystem

the operator

People who run and manage frameworks (Hadoop, Storm, MPI, Spark, Memcache, etc) Tools: virtual machines, Chef, Puppet (emerging: PAAS, Docker) “ops” at most companies (SREs at Twitter) the static partitioners

for the operator, Mesos is a cluster manager

for the operator, Mesos is a resource manager

for the operator, Mesos is a resource negotiator

for the operator, Mesos replaces static partitioning of resources to frameworks with dynamic resource allocation

for the operator, Mesos is a distributed system with a master/slave architecture masters slaves

frameworks/applications register with the Mesos master in order to run jobs/tasks masters slaves

frameworks can be required to authenticate as a principal* masters SASL CRAM-MD5 secret mechanism (Kerberos in development) framework masters initialized with secrets

Mesos is highly-available and fault-tolerant

the framework developer

Mesos uses Apache ZooKeeper for coordination masters slaves Apache ZooKeeper

increase utilization with revocable resources and preemption masters framework1 hostname: 4 CPUs 4 GB RAM role: - framework2framework3

optimistic vs pessimistic what to say here …

authorization* principals can be used for: authorizing allocation roles authorizing operating system users (for execution)

authorization

agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies

I’d love to answer some questions with the help of my data!

I think I’ll try Hadoop.

your datacenter

+ Hadoop

happy?

Not exactly …

… Hadoop is a big hammer, but not everything is a nail!

I’ve got some iterative algorithms, I want to try Spark!

datacenter management

static partitioning

static partitioning considered harmful

(1)hard to share data (2)hard to scale elastically (to exploit statistical multiplexing) (3)hard to fully utilize machines (4)hard to deal with failures

static partitioning considered harmful (1)hard to share data (2)hard to scale elastically (to exploit statistical multiplexing) (3)hard to fully utilize machines (4)hard to deal with failures

Hadoop … (map/reduce) (distributed file system)

Could we just give Spark it’s own HDFS cluster too?

HDFS x 2

tee incoming data (2 copies)

HDFS x 2 tee incoming data (2 copies) periodic copy/sync

That sounds annoying … let’s not do that. Can we do any better though?

During the day I’d rather give more machines to Spark but at night I’d rather give more machines to Hadoop!

I don’t want to deal with this!

the datacenter … rather than think about the datacenter like this …

… is a computer think about it like this …

datacenter computer applications resources filesystem

mesos applications resources filesystem kernel

mesos frameworks resources filesystem kernel

Step 1: filesystem

Step 2: mesos run a “master” (or multiple for high availability)

Step 2: mesos run “slaves” on the rest of the machines

Step 3: frameworks

tep 4: profit $

tep 4: profit (statistical multiplexing) $

$ reduces CapEx and OpEx!

tep 4: profit (statistical multiplexing) $ reduces latency!

tep 4: profit (utilize) $

tep 4: profit (failures) $

mesos frameworks resources filesystem kernel

mesos frameworks resources kernel

resource allocation

reservations can reserve resources per slave to provide guaranteed resources requires human participation (ops) to determine what roles should be reserved what resources kind of like thread affinity, but across many machines (and not just for CPUs)

resource allocation

(1)allocate reserved resources to frameworks authorized for a particular role (2)allocate unused reserved resources and unused unreserved resources fairly amongst all frameworks according to their weights

preemption if a framework runs tasks outside of it’s reservations they can be preempted (i.e., the task killed and the resources revoked) for a framework running a task within its reservation

mesos frameworks kernel

framework ≈ distributed system

framework commonality run processes/tasks simultaneously (distributed) handle process failures (fault-tolerant) optimize performance (elastic)

framework commonality run processes/tasks simultaneously (distributed) handle process failures (fault-tolerant) optimize performance (elastic) coordinate execution

frameworks are execution coordinators

frameworks are execution schedulers

end-to-end principle “application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes” i.e., frameworks want to coordinate their tasks execution and they should be able to

framework anatomy frameworks

framework anatomy frameworks scheduling API

scheduling

i’d like to run some tasks!

scheduling here are some resource offers!

resource offers an offer represents the snapshot of available resources on a particular machine that a framework can use to run tasks schedulers pick which resources to use to run their tasks foo.bar.com: 4 CPUs 4 GB RAM

“two-level scheduling” mesos: controls resource allocations to schedulers schedulers: make decisions about what to run given allocated resources

concurrency control the same resources may be offered to different frameworks

concurrency control the same resources may be offered to different frameworks optimisticpessimistic no overlapping offersall overlapping offers

tasks the “threads” of the framework, a consumer of resources (cpu, memory, etc) either a concrete command line or an opaque description (which requires an executor)

tasks here are some resources!

tasks launch these tasks!

status updates

task status update!

status updates

task status update!

more scheduling

i’d like to run some tasks!

high-availability

high-availability (master)

task status update!

high-availability (master) i’d like to run some tasks!

high-availability (master)

high-availability (framework)

high-availability (slave)

resource isolation leverage Linux control groups (cgroups) CPU (upper and lower bounds) memory network I/O (traffic controller, in progress) filesystem (lvm, in progress)

resource statistics rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using) per task/executor statistics are collected (for all fork/exec’ed processes too!) can help with capacity planning

security Twitter recently added SASL support, default mechanism is CRAM-MD5, will support Kerberos in the short term

framework commonality run processes/tasks simultaneously (distributed) handle process failures (fault-tolerant) optimize performance (elastic)

framework commonality as a “kernel”, mesos provides a lot of primitives that make writing a new framework easier such as launching tasks, doing failure detection, etc, why re-implement them each time!?

case study: chronos distributed cron with dependencies developed at airbnb ~3k lines of Scala! distributed, highly available, and fault tolerant without any network programming! http://github.com/airbnb/chronos

analytics

analytics + services

case study: aurora “run 200 of these, somewhere, forever” developed at Twitter highly available (uses the mesos replicated log) uses a python DSL to describe services leverages service discovery and proxying (see Twitter commons) http://github.com/twitter/aurora

frameworks Hadoop (github.com/mesos/hadoop) Spark (github.com/mesos/spark) DPark (github.com/douban/dpark) Storm (github.com/nathanmarz/storm) Chronos (github.com/airbnb/chronos) MPICH2 (in mesos git repository) Marathon (github.com/mesosphere/marathon) Aurora (github.com/twitter/aurora)

write your next distributed system with mesos!

port a framework to mesos write a “wrapper” scheduler ~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features) see http:// github.com/mesos/hadoop

conclusions datacenter management is a pain

conclusions mesos makes running frameworks on your datacenter easier as well as increasing utilization and performance while reducing CapEx and OpEx!

conclusions rather than build your next distributed system from scratch, consider using mesos

conclusions you can share your datacenter between analytics and online services!

Questions? mesos.apache.org @ApacheMesos

aurora

framework commonality run processes simultaneously (distributed) handle process failures (fault-tolerance) optimize execution (elasticity, scheduling)

primitives scheduler – distributed system “master” or “coordinator” (executor – lower-level control of task execution, optional) requests/offers – resource allocations tasks – “threads” of the distributed system …

scheduler Apache Hadoop Chronos

scheduler (1) brokers for resources (2) launches tasks (3) handles task termination

brokering for resources (1) make resource requests 2 CPUs 1 GB RAM slave * (2) respond to resource offers 4 CPUs 4 GB RAM slave foo.bar.com

offers: non-blocking resource allocation exist to answer the question: “what should mesos do if it can’t satisfy a request?” (1) wait until it can (2) offer the best allocation it can immediately

resource allocation Apache Hadoop Chronos request

resource allocation Apache Hadoop Chronos request allocator dominant resource fairness resource reservations

resource allocation Apache Hadoop Chronos request allocator dominant resource fairness resource reservations optimisticpessimistic

resource allocation Apache Hadoop Chronos request allocator dominant resource fairness resource reservations optimisticpessimistic no overlapping offersall overlapping offers

resource allocation Apache Hadoop Chronos offer allocator dominant resource fairness resource reservations

“two-level scheduling” mesos: controls resource allocations to framework schedulers schedulers: make decisions about what to run given allocated resources

end-to-end principle “application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes”

tasks either a concrete command line or an opaque description (which requires a framework executor to execute) a consumer of resources

task operations launching/killing health monitoring/reporting (failure detection) resource usage monitoring (statistics)

resource isolation cgroup per executor or task (if no executor) resource controls adjusted dynamically as tasks come and go!

case study: chronos distributed cron with dependencies built at airbnb by @flo

before chronos

single point of failure (and AWS was unreliable) resource starved (not scalable)

chronos requirements fault tolerance distributed (elastically take advantage of resources) retries (make sure a command eventually finishes) dependencies

chronos leverages the primitives of mesos ~3k lines of scala highly available (uses Mesos state) distributed / elastic no actual network programming!

after chronos

after chronos + hadoop

case study: aurora “run 200 of these, somewhere, forever” built at Twitter

before aurora static partitioning of machines to services hardware outages caused site outages puppet + monit ops couldn’t scale as fast as engineers

aurora highly available (uses mesos replicated log) uses a python DSL to describe services leverages service discovery and proxying (see Twitter commons)

after aurora power loss to 19 racks, no lost services! more than 400 engineers running services largest cluster has >2500 machines

Mesos Node Hadoop Node Spark Node MPI Storm Node Chronos

Mesos Node Hadoop Node Spark Node MPI Node …

Mesos Node Hadoop Node Spark Node MPI Storm Node …

Mesos Node Hadoop Node Spark Node MPI Storm Node Chronos …

tep 4: Profit (statistical multiplexing) $

Benjamin Hindman Apache Mesos Design Decisions

Similar presentations

Presentation on theme: "Benjamin Hindman Apache Mesos Design Decisions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Benjamin Hindman Apache Mesos Design Decisions

Similar presentations

Presentation on theme: "Benjamin Hindman Apache Mesos Design Decisions"— Presentation transcript:

Similar presentations

About project

Feedback