A Platform for Fine-Grained Resource Sharing in the Data Center

Slides:



Advertisements
Similar presentations
MapReduce: Simplified Data Processing on Large Cluster Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Long Kai and Philbert Lin.
Advertisements

Ali Ghodsi UC Berkeley & KTH & SICS
The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
Can’t We All Just Get Along? Sandy Ryza. Introductions Software engineer at Cloudera MapReduce, YARN, Resource management Hadoop committer.
SLA-Oriented Resource Provisioning for Cloud Computing
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
UC Berkeley Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Scott Shenker, Ion Stoica 1 RAD Lab, * Facebook Inc.
THE DATACENTER NEEDS AN OPERATING SYSTEM MATEI ZAHARIA, BENJAMIN HINDMAN, ANDY KONWINSKI, ALI GHODSI, ANTHONY JOSEPH, RANDY KATZ, SCOTT SHENKER, ION STOICA.
Resource Management with YARN: YARN Past, Present and Future
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Disk-Locality in Datacenter Computing Considered Irrelevant Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica 1.
The "Almost Perfect" Paper -Christopher Francis Hyma Chilukiri.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Why static is bad! Hadoop Pregel MPI Shared cluster Today: static partitioningWant dynamic sharing.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Delay.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
April 30, 2014 Anthony D. Joseph CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing.
Cluster Scheduler Reference: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI’2011 Multi-agent Cluster Scheduling for Scalability.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
A Platform for Fine-Grained Resource Sharing in the Data Center
CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing April 29, 2013 Anthony D. Joseph
Benjamin Hindman Apache Mesos Design Decisions
1 CS : Project Suggestions Ion Stoica ( September 14, 2011.
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Cloud Computing Ali Ghodsi UC Berkeley, AMPLab
CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing December 2, 2013 Anthony D. Joseph and John Canny
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica Good morning everyone. My name is Haoyuan,
(Private) Cloud Computing with Mesos at Twitter Benjamin
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Fair.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Scalable and Coordinated Scheduling for Cloud-Scale computing
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Next Generation of Apache Hadoop MapReduce Owen
Presented by Qifan Pu With many slides from Ali’s NSDI talk Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica.
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
Part III BigData Analysis Tools (YARN) Yuan Xue
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI 11’ Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
COS 518: Advanced Computer Systems Lecture 13 Michael Freedman
Mesos and Borg (Lecture 17, cs262a)
Scalable containers with Apache Mesos and DC/OS
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
COS 418: Distributed Systems Lecture 23 Michael Freedman
Introduction to Distributed Platforms
Running Multiple Schedulers in Kubernetes
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Managing Data Transfer in Computer Clusters with Orchestra
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
PA an Coordinated Memory Caching for Parallel Jobs
INDIGO – DataCloud PaaS
Software Engineering Introduction to Apache Hadoop Map Reduce
EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.
Omega: flexible, scalable schedulers for large compute clusters
Mesos and Borg and Kubernetes Lecture 13, cs262a
COS 518: Advanced Computer Systems Lecture 14 Michael Freedman
Cloud Computing Large-scale Resource Management
COS 518: Distributed Systems Lecture 11 Mike Freedman
Presentation transcript:

A Platform for Fine-Grained Resource Sharing in the Data Center Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andrew Konwinski, Matei Zaharia, Ali Ghodsi , Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica. Presented by Arka Bhattacharya Slides adapted from NSDI 2011 talk

Background Rapid innovation in cluster computing frameworks Pregel Dryad Pig percolator : incremental web-indexing system. MPI Percolator

Problem Rapid innovation in cluster computing frameworks No single framework for all applications. Want to run multiple frameworks in a single cluster …to maximize utilization …to share data between frameworks use same dataset ?

This paper is basically about : Today: static partitioning Mesos: dynamic sharing Hadoop Pregel static partitioning. peaks in demand at different points in time.good statistical multiplexing. adv: jobs will finish faster because they can scale up. you can run more workload on the nodes you already have. Shared cluster MPI

Other Benefits of Mesos Run multiple instances of the same framework Isolate production and experimental jobs Run multiple versions of the framework concurrently deploy upgrades ? Build specialized frameworks to experiment with other programming models, etc Simplify cluster management?

Solution Mesos is a common resource sharing layer over which diverse frameworks can run Hadoop Pregel Hadoop Pregel … Mesos … Node Node Node Node Node Node Node Node

Mesos Goals High utilization of resources Support diverse frameworks Scalability to 10,000’s of nodes Reliability in face of failures mesos code is small and simple majority scheduling is left to the frameworks. Resulting design: Small microkernel-like core that pushes scheduling logic to frameworks

Design Elements Fine-grained sharing: Resource offers: Allocation at the level of tasks within a job Improves utilization, latency, and data locality Resource offers: Simple, scalable application-controlled scheduling mechanism

Element 1: Fine-Grained Sharing Coarse-Grained Sharing (HPC): Fine-Grained Sharing (Mesos): Fw. 3 Fw. 3 Fw. 2 Fw. 1 Framework 1 Fw. 1 Fw. 2 Fw. 2 Storage System (e.g. HDFS) Storage System (e.g. HDFS) Fw. 2 Fw. 1 Fw. 1 Fw. 3 Framework 2 Fw. 3 Fw. 3 Fw. 2 share nodes in time and space. Fw. 2 Fw. 3 Fw. 2 Framework 3 Fw. 1 Fw. 2 Fw. 1 Fw. 3 + Improved utilization, responsiveness, data locality

Element 2: Resource Offers Option: Global scheduler Frameworks express needs in a specification language, global scheduler matches them to resources + Can (potentially) make optimal decisions if you have all the information – Complex: language must support all framework needs – Difficult to be robust Future frameworks may have unanticipated needs.

Element 2: Resource Offers Mesos: Resource offers Offer available resources to frameworks, let them pick which resources to use and which tasks to launch Keep Mesos simple, make it possible to support future frameworks Decentralized decisions might not be optimal A simple example of a resource offer : Airplane seat reservation.

Pick framework to offer resources to Mesos Architecture MPI job Hadoop job MPI scheduler Hadoop scheduler Pick framework to offer resources to Mesos master Allocation module Resource offer Mesos slave Mesos slave MPI executor MPI executor task task

Pick framework to offer resources to Mesos Architecture MPI job Hadoop job MPI scheduler Hadoop scheduler Resource offer = list of (node, availableResources) E.g. { (node1, <2 CPUs, 4 GB>), (node2, <3 CPUs, 2 GB>) } Pick framework to offer resources to Mesos master Allocation module Resource offer Mesos slave Mesos slave MPI executor MPI executor task task

Mesos Architecture MPI job Hadoop job MPI scheduler Hadoop scheduler Framework-specific scheduling task Pick framework to offer resources to Mesos master Allocation module Resource offer Note about separation of inter-framework and intra-framework How does framework know when to accept and reject offers ? Wait a little ! Mesos slave Mesos slave Launches and isolates executors MPI executor MPI executor Hadoop executor task task

Optimization: Filters Let frameworks short-circuit rejection by providing a predicate on resources to be offered E.g. “nodes from list L” or “nodes with > 8 GB RAM” Could generalize to other hints as well Ability to reject still ensures correctness when needs cannot be expressed using filters

Framework Isolation Mesos uses OS isolation mechanisms, such as Linux containers and Solaris projects Containers currently support CPU, memory, IO and network bandwidth isolation Not perfect, but much better than no isolation

Results Utilization and performance vs static partitioning Framework placement goals: data locality Scalability Fault recovery

Dynamic Resource Sharing also note how quickly the frameworks ramp up

Mesos vs Static Partitioning Compared performance with statically partitioned cluster where each framework gets 25% of nodes Framework Speedup on Mesos Facebook Hadoop Mix 1.14× Large Hadoop Mix 2.10× Spark 1.26× Torque / MPI 0.96×

Data Locality with Resource Offers Ran 16 instances of Hadoop on a shared HDFS cluster Used delay scheduling [EuroSys ’10] in Hadoop to get locality (wait a short time to acquire data-local nodes) 1.7×

Scalability Mesos only performs inter-framework scheduling (e.g. fair sharing), which is easier than intra-framework scheduling Result: Scaled to 50,000 emulated slaves, 200 frameworks, 100K tasks (30s len)

Mesos Now

Thoughts Resource offers vs resource requests: Unnecessary complication ? Was it driven by desire to change frameworks minimally ? Fragmentation and fairness. DRF paper shows that task sizes can be large, hence large wastage. Slightly messy fairness reasoning when resources are rejected by a framework How different are boolean “filters” from resource requests ? Is the two-level scheduling used by current data centers ? Mesosphere does not seem to need it for most of its use-cases.