Assessing the Performance Impact of Scheduling Policies in Spark

Slides:

Advertisements

Similar presentations

Can’t We All Just Get Along? Sandy Ryza. Introductions Software engineer at Cloudera MapReduce, YARN, Resource management Hadoop committer.

Advertisements

SLA-Oriented Resource Provisioning for Cloud Computing

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Spark: Cluster Computing with Working Sets

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.

Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

Towards Energy Efficient MapReduce Yanpei Chen, Laura Keys, Randy H. Katz University of California, Berkeley LoCal Retreat June 2009.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

Cross-Layer Scheduling in Cloud Computing Systems Authors: Hilfi Alkaff, Indranil Gupta.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人：碩資工一甲董耀文.

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

Storage in Big Data Systems

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.

A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Matchmaking: A New MapReduce Scheduling Technique

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Virtualization and Databases Ashraf Aboulnaga University of Waterloo.

Data Engineering How MapReduce Works

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Grid Appliance The World of Virtual Resource Sharing Group # 14 Dhairya Gala Priyank Shah.

Scalable and Coordinated Scheduling for Cloud-Scale computing

A Two-phase Execution Engine of Reduce Tasks In Hadoop MapReduce XiaohongZhang*GuoweiWang* ZijingYang*YangDing School of Computer Science and Technology.

Next Generation of Apache Hadoop MapReduce Owen

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Topics 1. Introduction 2. Virtualization Frameworks 3. Data Processing Engine 4. Evaluation 5. Conclusions and Future Work.

Big Data Analytics and HPC Platforms

TensorFlow– A system for large-scale machine learning

Distributed, real-time actionable insights on high-volume data streams

PROTECT | OPTIMIZE | TRANSFORM

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Machine Learning Library for Apache Ignite

Topo Sort on Spark GraphX Lecturer: 苟毓川

Introduction to Distributed Platforms

International Conference on Data Engineering (ICDE 2016)

Scaling Spark on HPC Systems

Introduction | Model | Solution | Evaluation

Edinburgh Napier University

Spark Presentation.

Data Platform and Analytics Foundational Training

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Presenter: Zhengyu Yang

The Basics of Apache Hadoop

湖南大学-信息科学与工程学院-计算机与科学系

CS110: Discussion about Spark

Ch 4. The Evolution of Analytic Scalability

Execution Framework: Hadoop 2.x

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Spark and Scala.

Interpret the execution mode of SQL query in F1 Query paper

TIM TAYLOR AND JOSH NEEDHAM

CS639: Data Management for Data Science

Big-Data Analytics with Azure HDInsight

MapReduce: Simplified Data Processing on Large Clusters

Lecture 29: Distributed Systems

CS639: Data Management for Data Science

Convergence of Big Data and Extreme Computing

Presentation transcript:

Assessing the Performance Impact of Scheduling Policies in Spark MaryAshley Etefia Computer Science Major University of Georgia Professor Ningfang Mi Mentor Zhengyu Yang NSF REU Award 1559894 This summer i worked under Prof... I had the assistance of my grad… And i worked on a project entitled….

Background + Keywords Spark is a real-time data processing framework Resilient distributed datasets (RDD) are fixed storage spaces RDD Dependency Scheduling policies decide when and where Cloud engineers and managers at data centers looking for ways to optimize memory resources for their customers Shiift from Hadoop MapReduce (batch - organizing old stored data) to spark ...within the Stages of a Spark application……. Rdd dep. Is relationship between rdds Scheduling policies decide when resources will be used and where they will go

Goals The difference between a narr… Wide (executes more shuffling) To identify which workloads have a wide or narrow RDD dependency To determine which scheduling policy is optimal for workloads that are running simultaneously The difference between a narr… Wide (executes more shuffling) - Now, Two types of policies are fair and fifo… - FAIR - works so that if a initial workload is running, workloads submitted in between will still receive resources w/o waiting longer - FIFO- dish out resources out to jobs submitted in order

This slide is just to give you a visual representation of wide vs narrow dependency………… lemme just point out the apparent influx of shuffling that occurs

State of the Art RDD dependencies within the stages of a Spark application must be investigated Understanding RDD dependency will help determine which scheduling policy is optimal So whats new?.... While memory management is nothing new…… Since stages containing rdds of a workload may be compromised/delayed when multiple workloads are running

VS. Results + Conclusion K-Means DAG Wordcount DAG PageRank DAG VS. Narrow RDD Dependency Wide RDD Dependency I ran 3 workloads separately so that I could investigate the dag… and as you can see… Pagerank’s dag is probably like this because its algorithm follows a recursive definition…. so a lot of looping and linking is executed PageRank has a wide dependency because it is evident that data is shuffled across multiple partitions or RDDs PageRank algorithm

Results + Conclusion (continued) Spark’s FAIR scheduling policy is optimal for a combination of workloads such as PageRank, K-Means, and Wordcount, since the average bandwidth for the FAIR policy was slightly higher than the FIFO policy Average FAIR bandwidth: 232.07 tps… Average FIFO bandwidth: 215.8 tps This 1st graph is just for reference Study of rdd dependencies acts as a proof/support for fifo vs fair graph

Possible Future Works Scaling performance results of low-level Spark workloads to datasets handled by data centers Vertical/horizontal sharding Future researchers/institutions can work with… Vertical - increasing memory in already present servers Horizontal - increasing servers

My Contribution Helped set up an Apache Spark cluster Investigated Directed Acyclic Graphs (DAGs) Configured FAIR and FIFO scheduling policies With 1 master, and 4 worker nodes while running single Spark workloads While running multiple workloads

What Did I Learn? A few tips about networking “Client/server” model Server operating systems Apache Spark cluster basics I/O tracing tools Memory management tips -hypervisors -iostat and blktrace

Acknowledgements National Science Foundation Supervisor, Melanie Smith Professor Ningfang Mi Graduate Mentor, Zhengyu Yang Lab Partner, Henry Santer Graduate Student, Mahsa Bayati Graduate Student, Danlin Jia