Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

Slides:

Advertisements

Similar presentations

MapReduce: Simplified Data Processing on Large Cluster Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Long Kai and Philbert Lin.

Advertisements

Ali Ghodsi UC Berkeley & KTH & SICS

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.

SLA-Oriented Resource Provisioning for Cloud Computing

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.

A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.

UC Berkeley Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Scott Shenker, Ion Stoica 1 RAD Lab, * Facebook Inc.

THE DATACENTER NEEDS AN OPERATING SYSTEM MATEI ZAHARIA, BENJAMIN HINDMAN, ANDY KONWINSKI, ALI GHODSI, ANTHONY JOSEPH, RANDY KATZ, SCOTT SHENKER, ION STOICA.

Spark: Cluster Computing with Working Sets

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Resource Management with YARN: YARN Past, Present and Future

Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

The "Almost Perfect" Paper -Christopher Francis Hyma Chilukiri.

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Why static is bad! Hadoop Pregel MPI Shared cluster Today: static partitioningWant dynamic sharing.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Delay.

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

Cluster Scheduler Reference: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI’2011 Multi-agent Cluster Scheduling for Scalability.

Outline | Motivation| Design | Results| Status| Future

Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

A Platform for Fine-Grained Resource Sharing in the Data Center

CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing April 29, 2013 Anthony D. Joseph

Benjamin Hindman Apache Mesos Design Decisions

1 CS : Project Suggestions Ion Stoica ( September 14, 2011.

UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Cloud Computing Ali Ghodsi UC Berkeley, AMPLab

CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing December 2, 2013 Anthony D. Joseph and John Canny

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

CloudClustering Ankur Dave*, Wei Lu†, Jared Jackson†, Roger Barga† *UC Berkeley †Microsoft Research Toward an Iterative Data Processing Pattern on the.

Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica Good morning everyone. My name is Haoyuan,

(Private) Cloud Computing with Mesos at Twitter Benjamin

Challenges towards Elastic Power Management in Internet Data Center.

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Fair.

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

A Platform for Fine-Grained Resource Sharing in the Data Center

Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.

Next Generation of Apache Hadoop MapReduce Owen

Presented by Qifan Pu With many slides from Ali’s NSDI talk Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica.

Apache Mesos What is it ? Beyond Hadoop Resource Sharing Mesos Intentions Architecture Users

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.

Part III BigData Analysis Tools (YARN) Yuan Xue

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI 11’ Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

COS 518: Advanced Computer Systems Lecture 13 Michael Freedman

Organizations Are Embracing New Opportunities

Introduction to Distributed Platforms

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Hadoop Clusters Tess Fulkerson.

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Introduction to Spark.

Sajitha Naduvil-vadukootu

EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.

Omega: flexible, scalable schedulers for large compute clusters

COS 518: Advanced Computer Systems Lecture 14 Michael Freedman

Cloud Computing Large-scale Resource Management

Cloud Computing MapReduce in Heterogeneous Environments

Presentation transcript:

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica University of California, Berkeley Presented by: Aditi Bose, Balasaheb Bagul

Pig Background Rapid innovation in cluster computing frameworks Dryad Pregel Percolator C IEL

Problem Rapid innovation in cluster computing frameworks No single framework optimal for all applications Want to run multiple frameworks in a single cluster »…to maximize utilization »…to share data between frameworks

Where We Want to Go Hadoop Pregel MPI Shared cluster Today: static partitioningMesos: dynamic sharing

Solution Mesos is a common resource sharing layer over which diverse frameworks can run Mesos Node Hadoop Pregel … Node Hadoop Node Pregel …

Other Benefits of Mesos Run multiple instances of the same framework »Isolate production and experimental jobs »Run multiple versions of the framework concurrently Build specialized frameworks targeting particular problem domains »Better performance than general-purpose abstractions

Outline Mesos Goals and Architecture Implementation Results Related Work

Mesos Goals High utilization of resources Support diverse frameworks (current & future) Scalability to 10,000’s of nodes Reliability in face of failures Resulting design: Small microkernel-like core that pushes scheduling logic to frameworks

Design Decision Push control to frameworks Benefits: »Implement diverse approaches to various problems »Fine grained sharing »Keep Mesos simple, scalable & robust Implementation: »Resource offers

Fine-Grained Sharing Framework 1 Framework 2 Framework 3 Coarse-Grained Sharing (HPC):Fine-Grained Sharing (Mesos): + Improved utilization, responsiveness Storage System (e.g. HDFS) Fw. 1 Fw. 3 Fw. 2 Fw. 1 Fw. 3 Fw. 2 Fw. 3 Fw. 1 Fw. 2 Fw. 1 Fw. 3 Fw. 2

Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler Allocation module Mesos master Mesos slave MPI executor Mesos slave MPI executor task Resource offer Pick framework to offer resources to

Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler Allocation module Mesos master Mesos slave MPI executor Mesos slave MPI executor task Pick framework to offer resources to Resource offer Resource offer = list of (node, availableResources) E.g. { (node1, ), (node2, ) } Resource offer = list of (node, availableResources) E.g. { (node1, ), (node2, ) }

Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler Allocation module Mesos master Mesos slave MPI executor Hadoop executor Mesos slave MPI executor task Pick framework to offer resources to task Framework-specific scheduling Resource offer Launches and isolates executors

Resource Allocation »Allocation by a pluggable allocation module »Reallocates when tasks finish »Guaranteed allocation to each framework »Trigger revocation if exceeds guaranteed allocation »Grace period to framework before revocation

Isolation »Leverages existing OS isolation mechanisms »Pluggable isolation modules »Currently isolate using OS container technologies

Resource Sharing Includes three mechanisms for this goal »First, Provide filters to the master »Second, Count resources offered as allocated »Third, if framework doesn’t respond fast, re-offer resources to other frameworks.

Fault Tolerance Takes care of the following failure scenarios »Master Failure »Node Failure »Scheduler Failure

Implementation

Implementation Stats 20,000 lines of C++ Master failover using ZooKeeper Frameworks ported: Hadoop, MPI, Torque New specialized framework: Spark, for iterative jobs (up to 20× faster than Hadoop) Open source in Apache Incubator

Users Twitter uses Mesos on > 100 nodes to run ~12 production services (mostly stream processing) Berkeley machine learning researchers are running several algorithms at scale on Spark Conviva is using Spark for data analytics UCSF medical researchers are using Mesos to run Hadoop and eventually non-Hadoop apps

Results »Utilization and performance vs static partitioning »Framework placement goals: data locality »Scalability »Fault recovery

Dynamic Resource Sharing

Mesos vs Static Partitioning Compared performance with statically partitioned cluster where each framework gets 25% of nodes FrameworkSpeedup on Mesos Facebook Hadoop Mix1.14× Large Hadoop Mix2.10× Spark1.26× Torque / MPI0.96×

Ran 16 instances of Hadoop on a shared HDFS cluster Used delay scheduling [EuroSys ’10] in Hadoop to get locality (wait a short time to acquire data-local nodes) Data Locality with Resource Offers 1.7×

Scalability Scaled to 50,000 emulated slaves, 200 frameworks, 100K tasks (30s len)

Fault Tolerance Mesos master has only soft state: list of currently running frameworks and tasks Rebuild when frameworks and slaves re-register with new master after a failure Result: fault detection and recovery in ~10 sec

Related Work HPC & Grid schedulers (e.g. Torque, LSF, Sun Grid Engine) »Coarse-grained sharing for inelastic jobs (e.g. MPI) Virtual machine clouds »Coarse-grained sharing similar to HPC Condor »Centralized scheduler based on matchmaking Parallel work: Next-Generation Hadoop »Redesign of Hadoop to have per-application masters »Also aims to support non-MapReduce jobs

Conclusion Mesos shares clusters efficiently among diverse frameworks thanks to two design elements: »Fine-grained sharing at the level of tasks »Distributed Scheduling mechanism called as Resource offers, that delegates scheduling decisions to frameworks. Enables co-existence of current frameworks and development of new specialized ones In use at Twitter, UC Berkeley, Conviva and UCSF

Thank You

Framework Isolation Mesos uses OS isolation mechanisms, such as Linux containers and Solaris projects Containers currently support CPU, memory, IO and network bandwidth isolation Not perfect, but much better than no isolation

Analysis Resource offers work well when: »Frameworks can scale up and down elastically »Task durations are homogeneous »Frameworks have many preferred nodes These conditions hold in current data analytics frameworks (MapReduce, Dryad, …) »Work divided into short tasks to facilitate load balancing and fault recovery »Data replicated across multiple nodes

Revocation Mesos allocation modules can revoke (kill) tasks to meet organizational SLOs Framework given a grace period to clean up “Guaranteed share” API lets frameworks avoid revocation by staying below a certain share

Mesos API Scheduler Callbacks resourceOffer(offerId, offers) offerRescinded(offerId) statusUpdate(taskId, status) slaveLost(slaveId) Executor Callbacks launchTask(taskDescriptor) killTask(taskId) Executor Actions sendStatus(taskId, status) Scheduler Actions replyToOffer(offerId, tasks) setNeedsOffers(bool) setFilters(filters) getGuaranteedShare() killTask(taskId)