A Platform for Fine-Grained Resource Sharing in the Data Center

Name: A Platform for Fine-Grained Resource Sharing in the Data Center
Uploaded: 2017-12-02T19:34:37+00:00
Duration: PTM21S46
Channel: Sophia Williamson
Description: A Platform for Fine-Grained Resource Sharing in the Data Center

A Platform for Fine-Grained Resource Sharing in the Data Center
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andrew Konwinski, Matei Zaharia, etc. Presented by Chen He Adopted from authors website

At the Beginning Mesos is not the only solution to provide a platform for all application to share the cluster Paper analysis will be provided in the end as well as some research ideas Some background information is provided in Glossary Just mention briefly that there are things MR and Dryad can’t do, and that there are competing implementations; perhaps also note the need to share resources with other data center services here? The excitement surrounding cluster computing frameworks like Hadoop continues to accelerate. (e.g. EC2 Hadoop and Dryad in Azure) Startups, enterprises, and us researchers are bursting with ideas to improve these already existing frameworks. But more importantly as we encounter the limitations of MR, we’re making a shopping list of what we want in next generation frameworks, new abstractions, programming models, even new implementations of existing models (e.g. Erlang MR called Disco). We believe that no single framework can best facilitate this innovation, but instead that people will want to run existing and new frameworks on the same physical clusters at the same time. 2

Background Rapid innovation in cluster computing frameworks Pregel
Dryad Pig Just mention briefly that there are things MR and Dryad can’t do, and that there are competing implementations; perhaps also note the need to share resources with other data center services here? The excitement surrounding cluster computing frameworks like Hadoop continues to accelerate. (e.g. EC2 Hadoop and Dryad in Azure) Startups, enterprises, and us researchers are bursting with ideas to improve these already existing frameworks. But more importantly as we encounter the limitations of MR, we’re making a shopping list of what we want in next generation frameworks, new abstractions, programming models, even new implementations of existing models (e.g. Erlang MR called Disco). We believe that no single framework can best facilitate this innovation, but instead that people will want to run existing and new frameworks on the same physical clusters at the same time. Percolator

Problem Rapid innovation in cluster computing frameworks
This paper provides a platform for different frameworks Want to run multiple frameworks in a single cluster …to maximize utilization …to share data between frameworks Just mention briefly that there are things MR and Dryad can’t do, and that there are competing implementations; perhaps also note the need to share resources with other data center services here? The excitement surrounding cluster computing frameworks like Hadoop continues to accelerate. (e.g. EC2 Hadoop and Dryad in Azure) Startups, enterprises, and us researchers are bursting with ideas to improve these already existing frameworks. But more importantly as we encounter the limitations of MR, we’re making a shopping list of what we want in next generation frameworks, new abstractions, programming models, even new implementations of existing models (e.g. Erlang MR called Disco). We believe that no single framework can best facilitate this innovation, but instead that people will want to run existing and new frameworks on the same physical clusters at the same time.

Where We Want to Go Today: static partitioning Mesos: dynamic sharing
Hadoop Pregel Say that frameworks are developed independently Say multiple jobs -> demand varies over time Mention this is what current schedulers do Number of nodes can be decreased with better utilization Shared cluster MPI

Solution Mesos is a common resource sharing layer over which diverse frameworks can run Hadoop Pregel Hadoop Pregel … Mesos … Before and after pics Node Node Node Node Node Node Node Node

Other Benefits of Mesos
Run multiple instances of the same framework Isolate production and experimental jobs Run multiple versions of the framework concurrently Fully utilize Cluster Simplify cluster management?

Outline Mesos Goals and Architecture Implementation Results Related Work

Mesos Goals High utilization of resources Support diverse frameworks Scalability to 10,000’s of nodes Reliability in face of failures Assumptions Shared storage Not all tasks require full node Not all nodes identical We have achieved these goals and it’s been deployed Resulting design: Small microkernel-like core that pushes scheduling logic to frameworks

Design Elements Fine-grained sharing: Resource offers:
Allocation at the level of tasks within a job Improves utilization, latency, and data locality Resource offers: Simple, scalable application-controlled scheduling mechanism

Element 1: Fine-Grained Sharing
Coarse-Grained Sharing (HPC): Fine-Grained Sharing (Mesos): Fw. 3 Fw. 3 Fw. 2 Fw. 1 Framework 1 Fw. 1 Fw. 2 Fw. 2 Storage System (e.g. HDFS) Storage System (e.g. HDFS) Fw. 2 Fw. 1 Fw. 1 Fw. 3 Framework 2 Fw. 3 Fw. 3 Fw. 2 Mention short tasks Fw. 2 Fw. 3 Fw. 2 Framework 3 Fw. 1 Fw. 2 Fw. 1 Fw. 3 + Improved utilization, responsiveness, data locality

Element 2: Resource Offers
Option: Global scheduler Frameworks express needs in a specification language, global scheduler matches them to resources + Can make optimal decisions – Complex: language must support all framework needs – Difficult to be robust

Element 2: Resource Offers
Mesos: Resource offers Offer available resources to frameworks, let them pick which resources to use and which tasks to launch Keep Mesos simple, make it possible to support future frameworks Decentralized decisions might not be optimal Still need a language for describing resources, but that’s fundamentally easier No southwest

Pick framework to offer resources to
Mesos Architecture MPI job Hadoop job MPI scheduler Hadoop scheduler Pick framework to offer resources to Mesos master Allocation module Resource offer Mesos slave Mesos slave MPI executor MPI executor task task

Pick framework to offer resources to
Mesos Architecture MPI job Hadoop job MPI scheduler Hadoop scheduler Resource offer = list of (node, availableResources) E.g. { (node1, <2 CPUs, 4 GB>), (node2, <3 CPUs, 2 GB>) } Pick framework to offer resources to Mesos master Allocation module Resource offer Mesos slave Mesos slave MPI executor MPI executor task task

Mesos Architecture MPI job Hadoop job MPI scheduler Hadoop scheduler
Framework-specific scheduling task Pick framework to offer resources to Mesos master Allocation module Resource offer Note about separation of inter-framework and intra-framework Mesos slave Mesos slave Launches and isolates executors MPI executor MPI executor Hadoop executor task task

Optimization: Filters
Let frameworks short-circuit rejection by providing a predicate on resources to be offered E.g. “nodes from list L” or “nodes with > 8 GB RAM” Could generalize to other hints as well Ability to reject still ensures correctness when needs cannot be expressed using filters

Framework Isolation Mesos uses OS isolation mechanisms, such as Linux containers and Solaris projects Containers currently support CPU, memory, IO and network bandwidth isolation Not perfect, but much better than no isolation

Analysis Resource offers work well when:
Frameworks can scale up and down elastically Task durations are homogeneous Frameworks have many preferred nodes These conditions hold in current data analytics frameworks (MapReduce, Dryad, …) Work divided into short tasks to facilitate load balancing and fault recovery Data replicated across multiple nodes

Revocation Mesos allocation modules can revoke (kill) tasks to meet organizational policies Framework given a grace period to clean up “Guaranteed share” API lets frameworks avoid revocation by staying below a certain share

Implementation Stats 20,000 lines of C++ Master failover using ZooKeeper Frameworks ported: Hadoop, MPI, Torque New specialized framework: Spark, for iterative jobs Open source in Apache Incubator Forward pointer to sharing the network?

Users Twitter uses Mesos on > 100 nodes to run ~12 production services (mostly stream processing) Berkeley machine learning researchers are running several algorithms at scale on Spark UCSF medical researchers are using Mesos to run Hadoop and eventually non-Hadoop apps

Results Utilization and performance vs static partitioning
Framework placement goals: data locality Scalability Fault recovery

Dynamic Resource Sharing
This graph shows the number of CPUS allocated to each of the four frameworks in the macrobenchmark over time for the duration of the 25 minute experiment. You can see that some frameworks take advantage of the cluster when other frameworks have a lull in activity. For instance, around time 1050, the Torque framework finishes running the MPI job and no more MPI jobs have been queued up to run, so it releases its shares (until another MPI job is queued around time 1250 seconds). The two hadoop jobs are able to take advantage of the resources Torque is not using and get more work done and increase the average percent of the cluster that is allocated at any time. The four frameworks include: (1) Spark, a specialized scala framework that was running iterative machine learning mapreduce jobs. (2) Facebook Hadoop Mix, Hive on top of Hadoop running a workload mix modeled from real Facebook traces (3) Large Hadoop Mix, large IO-intensive text search jobs representing large greedy workload, and (4) Torque/MPI is running a number of different sized instances of the tachyon ray tracing MPI application, part of the SPEC MPI2007 benchmark.

Mesos vs Static Partitioning
Compared performance with statically partitioned cluster where each framework gets 25% of nodes Framework Speedup on Mesos Facebook Hadoop Mix 1.14× Large Hadoop Mix 2.10× Spark 1.26× Torque / MPI 0.96× We also ran each of the 4 frameworks with its corresponding workload in a statically allocated dedicated cluster of 25 nodes. This represents the state of many organizations today, where each of the 4 frameworks gets its own smaller (in this case ¼ the size) dedicated cluster. For each framework, we show the number of CPUs that were allocated to that framework over time in the Mesos experiment and the dedicated cluster experiments. You can see where most of the frameworks are able to finish their jobs sooner when run on Mesos since they can occasionally get more than their fair share of the CPU. Torque, on the other hand, doesn’t accept more resources once it has reached its Safe Allocation (see section of the paper)

Data Locality with Resource Offers
Ran 16 instances of Hadoop on a shared HDFS cluster Used delay scheduling [EuroSys ’10] in Hadoop to get locality (wait a short time to acquire data-local nodes) 1.7× 93 EC2 instances, 4 cores, 15 GB ram, map-only job, scans 100 GB on HDFS, prints 1%, Setup: Mesos doesn’t know about locality, can we still get it?

Scalability Mesos only performs inter-framework scheduling (e.g. fair sharing), which is easier than intra-framework scheduling Result: Scaled to 50,000 emulated slaves, 200 frameworks, 100K tasks (30s len) Equivalent traffic Emulated All slaves are generated from the 200 frameworks, let see if the 200 framework continuously generate tasks to run on slave nodes, each task will take 30 seconds and each slave runs up to two tasks at a time. Once the cluster reach a steady-stage, (200 framework achieve fair share and all resources have been allocated. They launched one test framework that runs a single 10 second task and measure how long this framework will take. ), The overhead includesregister with master, wait for a resource offer and accept it, wait for master process response, and lauch a task on a slave, wait for mesos report task finish.

Fault Tolerance Mesos Master failure Framework node failure
Mesos master has only soft state: list of currently running frameworks and tasks Rebuild when frameworks and slaves re-register with new master after a failure Framework node failure Scheduler Failure Result: fault detection and recovery in ~10 sec Is it good or bad? How does it scale The Zookeeper keeps several (4) standby masters, only one is the leader, if leader fails, zookeeper will elect a new leader and announce to slaves and schedulers. Because state information is stored in slaves and schedulers, the slaves and schedulers will connect to the newly master and help it restore state. Compare the time of new master starts and the time old master die to get MTTR

Related Work HPC schedulers (e.g. Torque, LSF, Sun Grid Engine)
Coarse-grained sharing for inelastic jobs (e.g. MPI) Virtual machine clouds Coarse-grained sharing similar to HPC Condor Centralized scheduler based on matchmaking Parallel work: Next-Generation Hadoop Redesign of Hadoop to have per-application masters Also aims to support non-MapReduce jobs Based on resource request language with locality prefs Parallel

Conclusion Mesos shares clusters efficiently among diverse frameworks thanks to two design elements: Fine-grained sharing at the level of tasks Resource offers, a scalable mechanism for application-controlled scheduling Enables co-existence of current frameworks and development of new specialized ones In use at Twitter, UC Berkeley, and UCSF

Critical thinking of Mesos
Should they compare performance with global scheduler instead of static partitioning mechanism? whether Mesos works well for Scientific data center like HCC. Filtering mechanism may cause problems:

Questions

Mesos API Scheduler Callbacks resourceOffer(offerId, offers)
offerRescinded(offerId) statusUpdate(taskId, status) slaveLost(slaveId) Scheduler Actions replyToOffer(offerId, tasks) setNeedsOffers(bool) setFilters(filters) getGuaranteedShare() killTask(taskId) Executor Callbacks launchTask(taskDescriptor) killTask(taskId) Executor Actions sendStatus(taskId, status)

Limitations from the Author point of view
Fragmentation Heterogeneous resource demands Waste Resource Starvation of Large tasks

A Platform for Fine-Grained Resource Sharing in the Data Center

Similar presentations

Presentation on theme: "A Platform for Fine-Grained Resource Sharing in the Data Center"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Platform for Fine-Grained Resource Sharing in the Data Center

Similar presentations

Presentation on theme: "A Platform for Fine-Grained Resource Sharing in the Data Center"— Presentation transcript:

Similar presentations

About project

Feedback