Matei Zaharia, Dhruba Borthakur , Joydeep Sen Sarma , Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Delay.

Slides:

Advertisements

Similar presentations

Starfish: A Self-tuning System for Big Data Analytics.

Advertisements

Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica.

Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Srikanth Kandula, Scott Shenker, Ion Stoica.

GRASS: Trimming Stragglers in Approximation Analytics Ganesh Ananthanarayanan, Michael Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, Minlan Yu.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.

UC Berkeley Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Scott Shenker, Ion Stoica 1 RAD Lab, * Facebook Inc.

Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,

UC Berkeley Job Scheduling with the Fair and Capacity Schedulers Matei Zaharia Wednesday, June 10, 2009 Santa Clara Marriott.

Towards Decoupling Storage and Computation in Hadoop with SuperDataNodes George Porter * Center for Networked Systems U. C. San Diego LADIS 2009 / Big.

Quincy: Fair Scheduling for Distributed Computing Clusters Microsoft Research Silicon Valley SOSP’09 Presented at the Big Data Reading Group by Babu Pillai.

Why static is bad! Hadoop Pregel MPI Shared cluster Today: static partitioningWant dynamic sharing.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

UC Berkeley Monitoring Hadoop through Tracing Andy Konwinski and Matei Zaharia.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

The Power of Choice in Data-Aware Cluster Scheduling

The Case for Tiny Tasks in Compute Clusters Kay Ousterhout *, Aurojit Panda *, Joshua Rosen *, Shivaram Venkataraman *, Reynold Xin *, Sylvia Ratnasamy.

Distributed Low-Latency Scheduling

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

A Platform for Fine-Grained Resource Sharing in the Data Center

A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人：碩資工一甲董耀文.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Introduction to Hadoop and HDFS

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.

임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.

N ATJAM : S UPPORTING D EADLINES AND P RIORITIES IN A M APREDUCE C LUSTER Brian Cho (Samsung/Illinois), Muntasir Rahman, Tej Chajed, Indranil Gupta, Cristina.

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.

Reining in the Outliers in Map-Reduce Clusters using Mantri Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha,

Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.

Using Map-reduce to Support MPMD Peng

Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Fair.

Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica.

Matchmaking: A New MapReduce Scheduling Technique

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

A Platform for Fine-Grained Resource Sharing in the Data Center

Using Map-reduce to Support MPMD Peng

Next Generation of Apache Hadoop MapReduce Owen

Presented by Qifan Pu With many slides from Ali’s NSDI talk Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica.

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI 11’ Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

GRASS: Trimming Stragglers in Approximation Analytics

OPERATING SYSTEMS CS 3502 Fall 2017

Hadoop Aakash Kag What Why How 1.

Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Managing Data Transfer in Computer Clusters with Orchestra

CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016

CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017

Introduction to MapReduce and Hadoop

PA an Coordinated Memory Caching for Parallel Jobs

Cloud Computing Large-scale Resource Management

Hawk: Hybrid Datacenter Scheduling

Cloud Computing MapReduce in Heterogeneous Environments

Fast, Interactive, Language-Integrated Cluster Computing

Presentation transcript:

Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Delay Scheduling A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

Motivation MapReduce / Hadoop originally designed for high throughput batch processing Today’s workload is far more diverse: – Many users want to share a cluster Engineering, marketing, business intelligence, etc – Vast majority of jobs are short Ad-hoc queries, sampling, periodic reports – Response time is critical Interactive queries, deadline-driven reports  How can we efficiently share MapReduce clusters between users?

Example: Hadoop at Facebook 600-node, 2 PB data warehouse, growing at 15 TB/day Applications: data mining, spam detection, ads 200 users (half non-engineers) 7500 MapReduce jobs / day Median: 84s

Approaches to Sharing Hadoop default scheduler (FIFO) – Problem: short jobs get stuck behind long ones Separate clusters – Problem 1: poor utilization – Problem 2: costly data replication Full replication across clusters nearly infeasible at Facebook/Yahoo! scale Partial replication prevents cross-dataset queries

Our Work Hadoop Fair Scheduler – Fine-grained sharing at level of map & reduce tasks – Predictable response times and user isolation Main challenge: data locality – For efficiency, must run tasks near their input data – Strictly following any job queuing policy hurts locality: job picked by policy may not have data on free nodes Solution: delay scheduling – Relax queuing policy for limited time to achieve locality

The Problem Job 2 Master Job 1 Scheduling order Slave Task 2 Task 5 Task 3 Task 1 Task 7 Task 4 File 1: File 2:

The Problem Job 1 Master Job 2 Scheduling order Slave Task 2 Task 5 Task 3 Task 1 File 1: File 2: Task 1 Task 7 Task 2 Task 4 Task 3 Problem: Fair decision hurts locality Especially bad for jobs with small input files

Locality vs. Job Size at Facebook 58% of jobs

Special Instance: Sticky Slots Under fair sharing, locality can be poor even when all jobs have large input files Problem: jobs get “stuck” in the same set of task slots – When one a task in job j finishes, the slot it was running in is given back to j, because j is below its share – Bad because data files are spread out across all nodes Slave Master Task

Special Instance: Sticky Slots Under fair sharing, locality can be poor even when all jobs have large input files Problem: jobs get “stuck” in the same set of task slots – When one a task in job j finishes, the slot it was running in is given back to j, because j is below its share – Bad because data files are spread out across all nodes

Solution: Delay Scheduling Relax queuing policy to make jobs wait for a limited time if they cannot launch local tasks Result: Very short wait time (1-5s) is enough to get nearly 100% locality

Delay Scheduling Example Job 1 Master Job 2 Scheduling order Slave Task 2 Task 3 File 1: File 2: Task 8 Task 7 Task 2 Task 4 Task 6 Idea: Wait a short time to get data-local scheduling opportunities Task 5 Task 1 Task 3 Wait!

Delay Scheduling Details Scan jobs in order given by queuing policy, picking first that is permitted to launch a task Jobs must wait before being permitted to launch non-local tasks – If wait < T 1, only allow node-local tasks – If T 1 < wait < T 2, also allow rack-local – If wait > T 2, also allow off-rack Increase a job’s time waited when it is skipped

Analysis When is it worth it to wait, and for how long?  Waiting worth it if E(wait time) < cost to run non-locally Simple model: Data-local scheduling opportunities arrive following Poisson process Expected response time gain from delay scheduling: E(gain) = (1 – e -w/t )(d – t)  Optimal wait time is infinity Delay from running non-locally Expected time to get local scheduling opportunity Wait amount

Analysis Continued E(gain) = (1 – e -w/t )(d – t) Typical values of t and d: t = (avg task length) / (file replication × tasks per node) d ≥ (avg task length) Delay from running non-locally Expected time to get local scheduling opportunity Wait amount 36

More Analysis What if some nodes are occupied by long tasks? Suppose we have: – Fraction  of active tasks are long – R replicas of each block – L task slots per node For a given data block, P(all replicas blocked by long tasks) =  RL With R = 3 and L = 6, this is less than 2% if  < 0.8

Evaluation Macrobenchmark – IO-heavy workload – CPU-heavy workload – Mixed workload Microbenchmarks – Sticky slots – Small jobs – Hierarchical fair scheduling Sensitivity analysis Scheduler overhead

Macrobenchmark 100-node EC2 cluster, 4 cores/node Job submission schedule based on job sizes and inter-arrival times at Facebook – 100 jobs grouped into 9 “bins” of sizes Three workloads: – IO-heavy, CPU-heavy, mixed Three schedulers: – FIFO – Fair sharing – Fair + delay scheduling (wait time = 5s)

Results for IO-Heavy Workload Small Jobs (1-10 input blocks) Medium Jobs ( input blocks) Large Jobs (4800 input blocks) Job Response Times Up to 5x 40%

Results for IO-Heavy Workload

Sticky Slots Microbenchmark 5-50 jobs on EC2 100-node cluster 4 cores / node 5s delay scheduling 2x

Conclusions Delay scheduling works well under two conditions: – Sufficient fraction of tasks are short relative to jobs – There are many locations where a task can run efficiently Blocks replicated across nodes, multiple tasks/node Generalizes beyond MapReduce and fair sharing – Applies to any queuing policy that gives ordered job list Used in Hadoop Fair Scheduler to implement hierarchical policy – Applies to placement needs other than data locality

Thanks! This work is open source, packaged with Hadoop as the Hadoop Fair Scheduler – Old version without delay scheduling in use at Facebook since fall 2008 – Delay scheduling will appear in Hadoop 0.21 My

Related Work Quincy (SOSP ‘09) – Locality aware fair scheduler for Dryad – Expresses scheduling problem as min-cost flow – Kills tasks to reassign nodes when jobs change – One task slot per machine HPC scheduling – Inelastic jobs (cannot scale up/down over time) – Data locality generally not considered Grid computing – Focused on common interfaces for accessing compute resources in multiple administrative domains

Delay Scheduling Details when there is a free task slot on node n: sort jobs according to queuing policy for j in jobs: if j has node-local task t on n: j.level := 0; j.wait := 0; return t else if j has rack-local task t on n and (j.level ≥ 1 or j.wait ≥ T 1 ): j.level := 1; j.wait := 0; return t else if j.level = 2 or (j.level = 1 and j.wait ≥ T 2 ) or (j.level = 0 and j.wait ≥ T 1 + T 2 ): j.level := 2; j.wait := 0; return t else: j.wait += time since last scheduling decision

Sensitivity Analysis