Benjamin Hindman Apache Mesos Design Decisions
this is not a talk about YARN
at least not explicitly!
this talk is about Mesos!
a little history Mesos started as a research project at Berkeley in early 2009 by Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
our motivation increase performance and utilization of clusters
our intuition ①static partitioning considered harmful
static partitioning considered harmful datacenter
static partitioning considered harmful
faster!
higher utilization! static partitioning considered harmful
our intuition ②build new frameworks
“Map/Reduce is a big hammer, but not everything is a nail!”
Apache Mesos is a distributed system for running and building other distributed systems
Mesos is a cluster manager
Mesos is a resource manager
Mesos is a resource negotiator
Mesos replaces static partitioning of resources to frameworks with dynamic resource allocation
Mesos is a distributed system with a master/slave architecture masters slaves
frameworks register with the Mesos master in order to run jobs/tasks masters slaves frameworks
frameworks can be required to authenticate as a principal masters SASL CRAM-MD5 secret mechanism (Kerberos in development) framework masters initialized with secrets
in early 2010 goal: run long-running services elastically on Mesos
Apache Aurora (incubating) masters Aurora is a Mesos framework that makes it easy to launch services written in Ruby, Java, Scala, Python, Go, etc!
masters Storm, Jenkins, …
a lot of interesting design decisions along the way
many appear (IMHO) in YARN too
design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++
design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++
frameworks get allocated resources from the masters masters framework resources are allocated via resource offers a resource offer represents a snapshot of available resources (one offer per host) that a framework can use to run tasks offer hostname 4 CPUs 4 GB RAM
frameworks use these resources to decide what tasks to run masters framework a task can use a subset of an offer task 3 CPUs 2 GB RAM
Mesos challenged the status quo of cluster managers
cluster manager status quo cluster manager application specification the specification includes as much information as possible to assist the cluster manager in scheduling and execution
cluster manager status quo cluster manager application wait for task to be executed
cluster manager status quo cluster manager application result
problems with specifications ①hard to specify certain desires or constraints ②hard to update specifications dynamically as tasks executed and finished/failed
an alternative model masters framework request 3 CPUs 2 GB RAM a request is purposely simplified subset of a specification, mainly including the required resources
question: what should Mesos do if it can’t satisfy a request?
① wait until it can …
question: what should Mesos do if it can’t satisfy a request? ① wait until it can … ② offer the best it can immediately
question: what should Mesos do if it can’t satisfy a request? ① wait until it can … ② offer the best it can immediately
an alternative model masters framework offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM an alternative model masters framework offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM an alternative model masters framework offer hostname 4 CPUs 4 GB RAM framework uses the offers to perform it’s own scheduling
an analogue: non-blocking sockets kernel application write(s, buffer, size);
an analogue: non-blocking sockets kernel application 42 of 100 bytes written!
resource offers address asynchrony in resource allocation
IIUC, even YARN allocates “the best it can” to an application when it can’t satisfy a request
requests are complimentary (but not necessary)
offers represent the currently available resources a framework can use
question: should resources within offers be disjoint?
masters framework1framework2 offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM
concurrency control optimisticpessimistic
concurrency control optimisticpessimistic all offers overlap with one another, thus causing frameworks to “compete” first-come-first-served
concurrency control optimisticpessimistic offers made to different frameworks are disjoint
Mesos semantics: assume overlapping offers
design comparison: Google’s Omega
the Omega model database framework snapshot a framework gets a snapshot of the cluster state from a database (note, does not make a request!)
the Omega model database framework transaction a framework submits a transaction to the database to “acquire” resources (which it can then use to run tasks) failed transactions occur when another framework has already acquired sought resources
isomorphism?
observation: snapshots are optimistic offers
Omega and Mesos database framework snapshot masters framework offer hostname 4 CPUs 4 GB RAM
Omega and Mesos database framework transaction masters framework task 3 CPUs 2 GB RAM
thought experiment: what’s gained by exploiting the continuous spectrum of pessimistic to optimistic? optimisticpessimistic
design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++
Mesos allocates resources to frameworks using a fair-sharing algorithm we created called Dominant Resource Fairness (DRF)
DRF, born of static partitioning datacenter
static partitioning across teams promotionstrends recommendations team
promotionstrends recommendations team fairly shared! static partitioning across teams
goal: fairly share the resources without static partitioning
partition utilizations promotionstrends recommendations 45% CPU 100% RAM 75% CPU 100% RAM 100% CPU 50% RAM team utilization
observation: a dominant resource bottlenecks each team from running any more jobs/tasks
dominant resource bottlenecks promotionstrends recommendations team utilization bottleneckRAM 45% CPU 100% RAM 75% CPU 100% RAM 100% CPU 50% RAM RAMCPU
insight: allocating a fair share of each team’s dominant resource guarantees they can run at least as many jobs/tasks as with static partitioning!
… if my team gets at least 1/N of my dominant resource I will do no worse than if I had my own cluster, but I might do better when resources are available!
DRF in Mesos masters framework ①frameworks specify a role when they register (i.e., the team to charge for the resources)
DRF in Mesos masters framework ①frameworks specify a role when they register (i.e., the team to charge for the resources) ②master calculates each role’s dominant resource (dynamically) and allocates appropriately
tep 4: Profit (statistical multiplexing) $
in practice, fair sharing is insufficient
weighted fair sharing promotionstrends recommendations team
weighted fair sharing promotionstrends recommendations team weight
Mesos implements weighted DRF masters masters can be configured with weights per role resource allocation decisions incorporate the weights to determine dominant fair shares
in practice, weighted fair sharing is still insufficient
a non-cooperative framework (i.e., has long tasks or is buggy) can get allocated too many resources
Mesos provides reservations slaves can be configured with resource reservations for particular roles (dynamic, time based, and percentage based reservations are in development) resource offers include the reservation role (if any) masters framework (trends) offer hostname 4 CPUs 4 GB RAM role: trends
reservations reservations provide guarantees, but at the cost of utilization
revocable resources masters framework (promotions) reserved resources that are unused can be allocated to frameworks from different roles but those resources may be revoked at any time offer hostname 4 CPUs 4 GB RAM role: trends
preemption via revocation … my tasks will not be killed unless I’m using revocable resources!
design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++
high-availability and fault- tolerance a ①framework failover ②master failover ③slave failover machine failure process failure (bugs!) upgrades
high-availability and fault- tolerance a ①framework failover ②master failover ③slave failover machine failure process failure (bugs!) upgrades
masters ①framework failover framework framework re-registers with master and resumes operation all tasks keep running across framework failover! framework
high-availability and fault- tolerance a ①framework failover ②master failover ③slave failover machine failure process failure (bugs!) upgrades
masters ②master failover framework after a new master is elected all frameworks and slaves connect to the new master all tasks keep running across master failover!
high-availability and fault- tolerance a ①framework failover ②master failover ③slave failover machine failure process failure (bugs!) upgrades
slave ③slave failover mesos-slave task
slave ③slave failover mesos-slave task
slave ③slave failover task
slave ③slave failover mesos-slave task
slave ③slave failover mesos-slave task
slave ③slave mesos-slave (large in-memory services, expensive to restart)
design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++
execution masters framework task 3 CPUs 2 GB RAM frameworks launch fine-grained tasks for execution if necessary, a framework can provide an executor to handle the execution of a task
slave executor mesos-slave executor task
slave executor mesos-slave executor task
slave executor mesos-slave executor task
goal: isolation
slave isolation mesos-slave executor task
slave isolation mesos-slave executor task containers
executor + task design means containers can have changing resource allocations
slave isolation mesos-slave executor task
slave isolation mesos-slave executor task
slave isolation mesos-slave executor task
slave isolation mesos-slave executor task
slave isolation mesos-slave executor task
slave isolation mesos-slave executor task
slave isolation mesos-slave executor task
making the task first-class gives us true fine-grained resources sharing
requirement: fast task launching (i.e., milliseconds or less)
virtual machines an anti-pattern
operating-system virtualization containers (zones and projects) control groups (cgroups) namespaces
isolation support tight integration with cgroups CPU (upper and lower bounds) memory network I/O (traffic controller, in development) filesystem (using LVM, in development)
statistics too rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using) for capacity planning (and oversubscription in development)
CPU upper bounds? in practice, determinism trumps utilization
design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++
requirements: ①performance ②maintainability (static typing) ③interfaces to low-level OS (for isolation, etc) ④interoperability with other languages (for library bindings)
garbage collection a performance anti-pattern
consequences: ①antiquated libraries (especially around concurrency and networking) ②nascent community
github.com/3rdparty/libprocess concurrency via futures/actors, networking via message passing
github.com/3rdparty/stout monads in C++, safe and understandable utilities
but …
scalability simulations to 50,000+ slaves
@twitter we run multiple Mesos clusters each with nodes
design decisions ①two-level scheduling and resource offers ②fair-sharing and revocable resources ③high-availability and fault-tolerance ④execution and isolation ⑤C++
final remarks
frameworks Hadoop (github.com/mesos/hadoop) Spark (github.com/mesos/spark) DPark (github.com/douban/dpark) Storm (github.com/nathanmarz/storm) Chronos (github.com/airbnb/chronos) MPICH2 (in mesos git repository) Marathon (github.com/mesosphere/marathon) Aurora (github.com/twitter/aurora)
write your next distributed system with Mesos!
port a framework to Mesos write a “wrapper” ~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features) see github.com/mesos/hadoop
Thank You! mesos.apache.org
master ②master failover framework after a new master is elected all frameworks and slaves connect to the new master all tasks keep running across master failover!
stateless master to make master failover fast, we choose to make the master stateless state is stored in the leaves, at the frameworks and the slaves makes sense for frameworks that don’t want to store state (i.e., can’t actually failover) consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)
master failover to make master failover fast, we choose to make the master stateless state is stored in the leaves, at the frameworks and the slaves makes sense for frameworks that don’t want to store state (i.e., can’t actually failover) consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)
Apache Mesos is a distributed system for running and building other distributed systems
origins Berkeley research project including Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica mesos.apache.org/documentation
ecosystem mesos developers operators framework developers
a tour of mesos from different perspectives of the ecosystem
the operator
People who run and manage frameworks (Hadoop, Storm, MPI, Spark, Memcache, etc) Tools: virtual machines, Chef, Puppet (emerging: PAAS, Docker) “ops” at most companies (SREs at Twitter) the static partitioners
for the operator, Mesos is a cluster manager
for the operator, Mesos is a resource manager
for the operator, Mesos is a resource negotiator
for the operator, Mesos replaces static partitioning of resources to frameworks with dynamic resource allocation
for the operator, Mesos is a distributed system with a master/slave architecture masters slaves
frameworks/applications register with the Mesos master in order to run jobs/tasks masters slaves
frameworks can be required to authenticate as a principal* masters SASL CRAM-MD5 secret mechanism (Kerberos in development) framework masters initialized with secrets
Mesos is highly-available and fault-tolerant
the framework developer
…
Mesos uses Apache ZooKeeper for coordination masters slaves Apache ZooKeeper
increase utilization with revocable resources and preemption masters framework1 hostname: 4 CPUs 4 GB RAM role: - framework2framework3
optimistic vs pessimistic what to say here …
authorization* principals can be used for: authorizing allocation roles authorizing operating system users (for execution)
authorization
agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies
agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies
I’d love to answer some questions with the help of my data!
I think I’ll try Hadoop.
your datacenter
+ Hadoop
happy?
Not exactly …
… Hadoop is a big hammer, but not everything is a nail!
I’ve got some iterative algorithms, I want to try Spark!
datacenter management
static partitioning
static partitioning considered harmful
(1)hard to share data (2)hard to scale elastically (to exploit statistical multiplexing) (3)hard to fully utilize machines (4)hard to deal with failures
static partitioning considered harmful (1)hard to share data (2)hard to scale elastically (to exploit statistical multiplexing) (3)hard to fully utilize machines (4)hard to deal with failures
Hadoop … (map/reduce) (distributed file system)
HDFS
Could we just give Spark it’s own HDFS cluster too?
HDFS x 2
tee incoming data (2 copies)
HDFS x 2 tee incoming data (2 copies) periodic copy/sync
That sounds annoying … let’s not do that. Can we do any better though?
HDFS
static partitioning considered harmful (1)hard to share data (2)hard to scale elastically (to exploit statistical multiplexing) (3)hard to fully utilize machines (4)hard to deal with failures
During the day I’d rather give more machines to Spark but at night I’d rather give more machines to Hadoop!
datacenter management
static partitioning considered harmful (1)hard to share data (2)hard to scale elastically (to exploit statistical multiplexing) (3)hard to fully utilize machines (4)hard to deal with failures
datacenter management
static partitioning considered harmful (1)hard to share data (2)hard to scale elastically (to exploit statistical multiplexing) (3)hard to fully utilize machines (4)hard to deal with failures
datacenter management
I don’t want to deal with this!
the datacenter … rather than think about the datacenter like this …
… is a computer think about it like this …
datacenter computer applications resources filesystem
mesos applications resources filesystem kernel
mesos applications resources filesystem kernel
mesos frameworks resources filesystem kernel
Step 1: filesystem
Step 2: mesos run a “master” (or multiple for high availability)
Step 2: mesos run “slaves” on the rest of the machines
Step 3: frameworks
tep 4: profit $
tep 4: profit (statistical multiplexing) $
$
$
$
$
$ reduces CapEx and OpEx!
tep 4: profit (statistical multiplexing) $ reduces latency!
tep 4: profit (utilize) $
$
$
$
$
$
tep 4: profit (failures) $
$
$
agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies
agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies
mesos frameworks resources filesystem kernel
mesos frameworks resources kernel
resource allocation
reservations can reserve resources per slave to provide guaranteed resources requires human participation (ops) to determine what roles should be reserved what resources kind of like thread affinity, but across many machines (and not just for CPUs)
resource allocation
(1)allocate reserved resources to frameworks authorized for a particular role (2)allocate unused reserved resources and unused unreserved resources fairly amongst all frameworks according to their weights
preemption if a framework runs tasks outside of it’s reservations they can be preempted (i.e., the task killed and the resources revoked) for a framework running a task within its reservation
agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies
mesos frameworks kernel
framework ≈ distributed system
framework commonality run processes/tasks simultaneously (distributed) handle process failures (fault-tolerant) optimize performance (elastic)
framework commonality run processes/tasks simultaneously (distributed) handle process failures (fault-tolerant) optimize performance (elastic) coordinate execution
frameworks are execution coordinators
frameworks are execution schedulers
end-to-end principle “application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes” i.e., frameworks want to coordinate their tasks execution and they should be able to
framework anatomy frameworks
framework anatomy frameworks scheduling API
scheduling
i’d like to run some tasks!
scheduling here are some resource offers!
resource offers an offer represents the snapshot of available resources on a particular machine that a framework can use to run tasks schedulers pick which resources to use to run their tasks foo.bar.com: 4 CPUs 4 GB RAM
“two-level scheduling” mesos: controls resource allocations to schedulers schedulers: make decisions about what to run given allocated resources
concurrency control the same resources may be offered to different frameworks
concurrency control the same resources may be offered to different frameworks optimisticpessimistic no overlapping offersall overlapping offers
tasks the “threads” of the framework, a consumer of resources (cpu, memory, etc) either a concrete command line or an opaque description (which requires an executor)
tasks here are some resources!
tasks launch these tasks!
tasks
status updates
task status update!
status updates
task status update!
more scheduling
i’d like to run some tasks!
agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies
high-availability
high-availability (master)
task status update!
high-availability (master) i’d like to run some tasks!
high-availability (master)
high-availability (framework)
high-availability (slave)
agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies
resource isolation leverage Linux control groups (cgroups) CPU (upper and lower bounds) memory network I/O (traffic controller, in progress) filesystem (lvm, in progress)
resource statistics rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using) per task/executor statistics are collected (for all fork/exec’ed processes too!) can help with capacity planning
agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies
security Twitter recently added SASL support, default mechanism is CRAM-MD5, will support Kerberos in the short term
agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies
framework commonality run processes/tasks simultaneously (distributed) handle process failures (fault-tolerant) optimize performance (elastic)
framework commonality as a “kernel”, mesos provides a lot of primitives that make writing a new framework easier such as launching tasks, doing failure detection, etc, why re-implement them each time!?
case study: chronos distributed cron with dependencies developed at airbnb ~3k lines of Scala! distributed, highly available, and fault tolerant without any network programming!
analytics
analytics + services
case study: aurora “run 200 of these, somewhere, forever” developed at Twitter highly available (uses the mesos replicated log) uses a python DSL to describe services leverages service discovery and proxying (see Twitter commons)
frameworks Hadoop (github.com/mesos/hadoop) Spark (github.com/mesos/spark) DPark (github.com/douban/dpark) Storm (github.com/nathanmarz/storm) Chronos (github.com/airbnb/chronos) MPICH2 (in mesos git repository) Marathon (github.com/mesosphere/marathon) Aurora (github.com/twitter/aurora)
write your next distributed system with mesos!
port a framework to mesos write a “wrapper” scheduler ~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features) see github.com/mesos/hadoop
conclusions datacenter management is a pain
conclusions mesos makes running frameworks on your datacenter easier as well as increasing utilization and performance while reducing CapEx and OpEx!
conclusions rather than build your next distributed system from scratch, consider using mesos
conclusions you can share your datacenter between analytics and online services!
Questions?
aurora
framework commonality run processes simultaneously (distributed) handle process failures (fault-tolerance) optimize execution (elasticity, scheduling)
primitives scheduler – distributed system “master” or “coordinator” (executor – lower-level control of task execution, optional) requests/offers – resource allocations tasks – “threads” of the distributed system …
scheduler Apache Hadoop Chronos
scheduler (1) brokers for resources (2) launches tasks (3) handles task termination
brokering for resources (1) make resource requests 2 CPUs 1 GB RAM slave * (2) respond to resource offers 4 CPUs 4 GB RAM slave foo.bar.com
offers: non-blocking resource allocation exist to answer the question: “what should mesos do if it can’t satisfy a request?” (1) wait until it can (2) offer the best allocation it can immediately
offers: non-blocking resource allocation exist to answer the question: “what should mesos do if it can’t satisfy a request?” (1) wait until it can (2) offer the best allocation it can immediately
resource allocation Apache Hadoop Chronos request
resource allocation Apache Hadoop Chronos request allocator dominant resource fairness resource reservations
resource allocation Apache Hadoop Chronos request allocator dominant resource fairness resource reservations optimisticpessimistic
resource allocation Apache Hadoop Chronos request allocator dominant resource fairness resource reservations optimisticpessimistic no overlapping offersall overlapping offers
resource allocation Apache Hadoop Chronos offer allocator dominant resource fairness resource reservations
“two-level scheduling” mesos: controls resource allocations to framework schedulers schedulers: make decisions about what to run given allocated resources
end-to-end principle “application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes”
tasks either a concrete command line or an opaque description (which requires a framework executor to execute) a consumer of resources
task operations launching/killing health monitoring/reporting (failure detection) resource usage monitoring (statistics)
resource isolation cgroup per executor or task (if no executor) resource controls adjusted dynamically as tasks come and go!
case study: chronos distributed cron with dependencies built at airbnb
before chronos
single point of failure (and AWS was unreliable) resource starved (not scalable)
chronos requirements fault tolerance distributed (elastically take advantage of resources) retries (make sure a command eventually finishes) dependencies
chronos leverages the primitives of mesos ~3k lines of scala highly available (uses Mesos state) distributed / elastic no actual network programming!
after chronos
after chronos + hadoop
case study: aurora “run 200 of these, somewhere, forever” built at Twitter
before aurora static partitioning of machines to services hardware outages caused site outages puppet + monit ops couldn’t scale as fast as engineers
aurora highly available (uses mesos replicated log) uses a python DSL to describe services leverages service discovery and proxying (see Twitter commons)
after aurora power loss to 19 racks, no lost services! more than 400 engineers running services largest cluster has >2500 machines
Mesos Node Hadoop Node Spark Node MPI Storm Node Chronos
Mesos Node Hadoop Node Spark Node MPI Node …
Mesos Node Hadoop Node Spark Node MPI Storm Node …
Mesos Node Hadoop Node Spark Node MPI Storm Node Chronos …
tep 4: Profit (statistical multiplexing) $