The Power of Choice in Data-Aware Cluster Scheduling

Slides:

Advertisements

Similar presentations

Starfish: A Self-tuning System for Big Data Analytics.

Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO

Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica.

Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Srikanth Kandula, Scott Shenker, Ion Stoica.

Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.

GRASS: Trimming Stragglers in Approximation Analytics Ganesh Ananthanarayanan, Michael Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, Minlan Yu.

Effective Straggler Mitigation: Attack of the Clones [1]

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.

UC Berkeley Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Scott Shenker, Ion Stoica 1 RAD Lab, * Facebook Inc.

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Enabling High-level SLOs on Shared Storage Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Randy Katz, Ion Stoica Cake 1.

Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,

Disk-Locality in Datacenter Computing Considered Irrelevant Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica 1.

Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun.

Making Sense of Spark Performance

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Delay.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

Adaptive Stream Processing using Dynamic Batch Sizing Tathagata Das, Yuan Zhong, Ion Stoica, Scott Shenker.

Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout Joint work with Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun UC.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

Distributed Process Management1 Learning Objectives Distributed Scheduling Algorithms Coordinator Elections Orphan Processes.

Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

The Case for Tiny Tasks in Compute Clusters Kay Ousterhout *, Aurojit Panda *, Joshua Rosen *, Shivaram Venkataraman *, Reynold Xin *, Sylvia Ratnasamy.

Distributed Low-Latency Scheduling

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve ， Devendra Dahiphale ， Amit Chhajer 報告 : 饒展榕.

The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.

Low Latency Geo-distributed Data Analytics Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya Akella, Paramvir Bahl, Ion Stoica.

Reining in the Outliers in Map-Reduce Clusters using Mantri Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha,

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Fair.

Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

CHAPTER 8 CAPACITY. THE CONCEPT Maximum rate of output for a process Inadequate capacity can lose customers and limit growth while excess capacity can.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06.

Scalable and Coordinated Scheduling for Cloud-Scale computing

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

ApproxHadoop Bringing Approximations to MapReduce Frameworks

6.888 Lecture 8: Networking for Data Analytics Mohammad Alizadeh Spring  Many thanks to Mosharaf Chowdhury (Michigan) and Kay Ousterhout (Berkeley)

1 VLDB, Background What is important for the user.

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Aalo Efficient Coflow Scheduling Without Prior Knowledge Mosharaf Chowdhury, Ion Stoica UC Berkeley.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Re-Architecting Apache Spark for Performance Understandability Kay Ousterhout Joint work with Christopher Canel, Max Wolffe, Sylvia Ratnasamy, Scott Shenker.

GRASS: Trimming Stragglers in Approximation Analytics

Packing Tasks with Dependencies

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)

Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le

Managing Data Transfer in Computer Clusters with Orchestra

Resource Elasticity for Large-Scale Machine Learning

PA an Coordinated Memory Caching for Parallel Jobs

EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.

湖南大学-信息科学与工程学院-计算机与科学系

Omega: flexible, scalable schedulers for large compute clusters

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Resource-Efficient and QoS-Aware Cluster Management

The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’

CherryPick: Adaptively Unearthing the Best

Fast, Interactive, Language-Integrated Cluster Computing

COS 518: Distributed Systems Lecture 11 Mike Freedman

EdgeWise: A Better Stream Processing Engine for the Edge

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

Presentation transcript:

The Power of Choice in Data-Aware Cluster Scheduling Shivaram Venkataraman1 , Aurojit Panda1 Ganesh Ananthanarayanan2, Michael Franklin1, Ion Stoica1 1UC Berkeley, 2Microsoft Research Present by Qi Wang Borrow some slides from the author

Trends: Big Data Data grows faster than Moore’s Law! [Kathy Yelick LBNL, VLDBJ 2012, Dhruba Borthakur]

Trends: Low Latency 10 min 1 min 10s 2s 2004: MapReduce batch job 2009: Hive 2010: Dremel 2012: In-memory Spark

? Big Data or Low Latency SQL Query : 2.5 TB on 100 machines < 10s > 15 minutes 1 - 5 Minutes

Trends: Sampling Use a subset of the input data

Applications Approximate Query Processing Machine learning algorithms blinkdb presto minitable Machine learning algorithms stochastic gradient coordinate descent A key property of such jobs is that they can compute on any of the combinatorially many subsets of the input dataset without compromising application correct- ness.

Combinatorial Choices Any K N Sampling -> Smaller inputs + Choice

Existing scheduler Available (N) = 2 Required (K) = 2 1 2 3 4 Rack Time Available Data Running Busy Unavailable Data 11

Choice-aware scheduler Available (N) = 4 Required (K) = 2 1 2 Rack 3 4 Time Available Data Running Busy

Data-aware scheduling Input stage Less than 60% of tasks achieve locality even with three replica Tasks have to be scheduled with memory locality Intermediate stages The runtime is dictated by the amount of data transferred across racks To schedule the task at a machine that minimizes the time it takes to transfer all remote inputs

= Cross-rack skew 6 2 = 3 Core Bottleneck Link Link with Max. transfers 2 2 4 2 6 4 Cross Rack Data Skew M4 R2 M5 R3 M1 M2 Maximum Minimum transfers M3 R1 6 2 = = 3

KMN Scheduler A scheduling framework Input Stage Intermediate Stage Exploits the available choices to improve performance Choose the subset of data dynamically increase locality for input Stages balance network usage for intermediate stages Input Stage Intermediate Stage

Input stage Jobs which can use any K of the N input blocks Tasks read data from distributed storage Main idea: leverage the combinatorial choice of inputs

Input stage Jobs which use a custom sampling functions

Intermediate stage Challenges Main idea Scheduling a few additional tasks at upstream stage Schedule M tasks for an upstream stage with K tasks They consume all the outputs produced by upstream tasks. Thus, they have no choice in picking their inputs. We see that running a few extra tasks is an effective strategy to reduce skew.

Schedule additional tasks Available (N) = 4 Launched (M) = 3 Required (K) = 2 1 2 Rack 3 4 Time Available Data Running Busy

Select best upstream outputs Round-robin strategy Spread choice of K outputs across as many racks as possible M M M K = 5 cross-rack skew = 3

Select best upstream outputs After running extra tasks M M M M M M = 7 K = 5 cross-rack skew = 3

Select best upstream outputs M M M M M = 7 K = 5 cross-rack skew = 2

Handling Stragglers Wait all M can be inefficient due to stragglers Balance between waiting and losing choice A delay-based approach Bounding transfer Whenever the delay DK’ is greater than the maximum improvement, we can stop the search as the succeeding delays will increase the total time Coalescing tasks Coalesce a number of task finish events to further reduce the search space Our solution for this problem is to schedule downstream tasks at some point after K upstream tasks have completed but not wait for the stragglers in the M tasks balances the gains in balanced network transfers against the time spent waiting for stragglers. We use job sizes to figure out if the same job has been run before. Based on this we use the job history to predict the task finish times.

Using KMN KMN is built on top of Spark

Evaluation Cluster Setup Baseline Metrics 100 m2.4xlarge EC2 machines, 68GB RAM/mc Baseline Using existing schedulers with pre-selected samples Metrics % improvement = (baseline time – KMN time)/baseline time

Conviva Sampling jobs Running 4 real-world sampling queries obtained from Conviva

Machine learning workload Stochastic Gradient Descent Aggregate3 Aggregate2 Aggregate1 Gradient SGD is an iterative method that scales to large datasets and is widely used in applications such as machine translation and image classification. We run SGD on a dataset contain- ing 2 million data items, where each each item contains 4000 features. The complete dataset is around 64GB in size and each of our iterations operates on a 1% sample 1% of the data.

Machine learning workload Overall benefits KMN-M/K = 1.1 improves performance by 43% as compared to the baseline.

Facebook workload Workload trace from Facebook Hadoop cluster

Conclusion Application trends KMN Operate on subsets of data Exploit the available choices to improve performance Increasing locality balancing intermediate data transfers Reduce average job duration by 80% using just 5% additional resources

Cons Reduce the randomness of the selection Extra resources are used The utilization spikes also are not taken into account by the model and jobs which arrive during a spike do not get locality Need to modify threads Not helpful for small jobs No benefits for longer stages Not general to other applications ……

Questions Is round-robin the best heuristic that can be used here? What is the recommended value for extra upstream tasks? In evaluation part, authors tested with 0%, 5%, and 10%, but they did not discuss the optimal value for the extra portion of upstream tasks. How would the system scale with the number of stages in the computation? Since KMN expect “a few” extra jobs at each stage, would this number grows too big with the number of stages? The paper does not discuss how failures of jobs would impact the overall performance of KMN. Do they restart failed jobs, or do they simply rely on redundancy of jobs?

Thank you!

KMN Stages Time (s) Gradient 15.27 Gradient Aggregate 3 Aggregate 2

KMN Stages Time (s) Gradient 15.27 Gradient + Agg1 12.72 Aggregate 1

KMN Stages Time (s) Gradient 15.27 Gradient + Agg1 12.72 Aggregate2 Aggregate 1 Gradient KMN Stages Time (s) Gradient 15.27 Aggregate 3 Gradient + Agg1 12.72 Gradient + Agg2 11.79

KMN Stages Time (s) Gradient 15.27 Gradient + Agg1 12.72 Aggregate3 Aggregate2 Aggregate1 Gradient KMN Stages Time (s) Gradient 15.27 Gradient + Agg1 12.72 Gradient + Agg2 11.79 Gradient + Agg3 12.09