Making Sense of Spark Performance

Slides:



Advertisements
Similar presentations
Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Srikanth Kandula, Scott Shenker, Ion Stoica.
Advertisements

SDN + Storage.
Effective Straggler Mitigation: Attack of the Clones [1]
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
Chuang, Sang, Yoo, Gu, Killian and Kulkarni, “EventWave” SoCC ‘13 EventWave: Programming Model and Runtime Support for Tightly-Coupled Elastic Cloud Applications.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Disk-Locality in Datacenter Computing Considered Irrelevant Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica 1.
Towards Energy Efficient Hadoop Wednesday, June 10, 2009 Santa Clara Marriott Yanpei Chen, Laura Keys, Randy Katz RAD Lab, UC Berkeley.
Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun.
Distributed Computations
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Towards Energy Efficient MapReduce Yanpei Chen, Laura Keys, Randy H. Katz University of California, Berkeley LoCal Retreat June 2009.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout Joint work with Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun UC.
Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao Oct 2013 To appear in IEEE Transactions on.
The Power of Choice in Data-Aware Cluster Scheduling
The Case for Tiny Tasks in Compute Clusters Kay Ousterhout *, Aurojit Panda *, Joshua Rosen *, Shivaram Venkataraman *, Reynold Xin *, Sylvia Ratnasamy.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Distributed Low-Latency Scheduling
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.
A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人:碩資工一甲 董耀文.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Storage in Big Data Systems
MapReduce M/R slides adapted from those of Jeff Dean’s.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
MapReduce Algorithm Design Based on Jimmy Lin’s slides
Virtualization and Databases Ashraf Aboulnaga University of Waterloo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
A Platform for Fine-Grained Resource Sharing in the Data Center
6.888 Lecture 8: Networking for Data Analytics Mohammad Alizadeh Spring  Many thanks to Mosharaf Chowdhury (Michigan) and Kay Ousterhout (Berkeley)
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI 11’ Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
1 © Cloudera, Inc. All rights reserved. Open problems in distributed storage and resource management Karthik Kambatla |
CS239-Lecture 6 Performance Madan Musuvathi Visiting Professor, UCLA Principal Researcher, Microsoft Research.
Re-Architecting Apache Spark for Performance Understandability Kay Ousterhout Joint work with Christopher Canel, Max Wolffe, Sylvia Ratnasamy, Scott Shenker.
Big Data Analytics with Parallel Jobs
Big Data is a Big Deal!.
Packing Tasks with Dependencies
BD-Cache: Big Data Caching for Datacenters
Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Running virtualized Hadoop, does it make sense?
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker
MapReduce Simplified Data Processing on Large Cluster
Data Platform and Analytics Foundational Training
PA an Coordinated Memory Caching for Parallel Jobs
Hadoop Clusters Tess Fulkerson.
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Group 15 Swathi Gurram Prajakta Purohit
Hawk: Hybrid Datacenter Scheduling
Fast, Interactive, Language-Integrated Cluster Computing
CS 239 – Big Data Systems Fall 2018
COS 518: Distributed Systems Lecture 11 Mike Freedman
Presentation transcript:

Making Sense of Spark Performance eecs.berkeley.edu/~keo/traces Kay Ousterhout UC Berkeley In collaboration with Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun

About Me PhD student in Computer Science at UC Berkeley Thesis work centers around performance of large-scale distributed systems Spark PMC member

About This Talk Overview of how Spark works How we measured performance bottlenecks In-depth performance analysis for a few workloads Demo of performance analysis tool This is a research project, but this talk will be more practically-focused

Count the # of words in the document I am Sam 6 Sam I am Do you like 6 Spark driver: 6+6+4+5 = 21 Cluster of machines Green eggs and ham? 4 First, a quick refresher on how these frameworks work. Current execution engines architected in a similar way. Job: some analysis on large amount of data. Basic idea: split job into increments of work called tasks; each task runs on one machine. To give you a sense of what one of these tasks might do, look at a simple example… … Thank you, Sam I am 5 Spark (or Hadoop/Dryad/etc.) task

Count the # of occurrences of each word MAP REDUCE {I: 2, am: 2, …} {I: 4, you: 2, …} I am Sam {Sam: 1, I: 1, … } {am: 4, Green: 1, …} Sam I am Do you like {Green: 1, eggs: 1, … } {Sam: 4, …} Green eggs and ham? … {Thank: 1, eggs: 1, …} {Thank: 1, you: 1,… } Thank you, Sam I am

Performance considerations I am Sam Caching input data Scheduling: assigning tasks to machines Straggler tasks Network performance (e.g., during shuffle) Sam I am Do you like Green eggs and ham? … Thank you, Sam I am

Caching PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14] Scheduling Sparrow [SOSP ‘13], Apollo [OSDI ’14], Mesos [NSDI ‘11], DRF [NSDI ‘11], Tetris [SIGCOMM ’14], Omega [Eurosys ’13], YARN [SoCC ’13], Quincy [SOSP ‘09], KMN [OSDI ’14] Stragglers Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10], Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14] Network VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13], Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14], PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12], Camdoop [NSDI ’12], Oktopus [SIGCOMM ‘11]), EyeQ [NSDI ‘12], FairCloud [SIGCOMM ’12] TONS of work on improving performance Lots of valuable work here: many cite impressive improvements. What are the biggest problems? What do we do next? How do we go about this in a principled way? Whack-a-mole forever? Dogma has emerged from optimizations people have illustrated, rather than from any kind of comprehensive performance study Generalized programming model Dryad [Eurosys ‘07], Spark [NSDI ’12]

Network and disk I/O are bottlenecks Stragglers are a major issue with unknown causes Caching PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14] Scheduling Sparrow [SOSP ‘13], Apollo [OSDI ’14], Mesos [NSDI ‘11], DRF [NSDI ‘11], Tetris [SIGCOMM ’14], Omega [Eurosys ’13], YARN [SoCC ’13], Quincy [SOSP ‘09], KMN [OSDI ’14] Stragglers Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10], Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14] Network VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13], Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14], PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12], Camdoop [NSDI ’12], Oktopus [SIGCOMM ‘11]), EyeQ [NSDI ‘12], FairCloud [SIGCOMM ’12] Dogma has emerged from optimizations people have illustrated, rather than from any kind of comprehensive performance study Generalized programming model Dryad [Eurosys ‘07], Spark [NSDI ’12]

This Work (1) Methodology for quantifying performance bottlenecks (2) Bottleneck measurement for 3 SQL workloads (TPC-DS and 2 others)

Network optimizations can reduce job completion time by at most 2% CPU (not I/O) often the bottleneck Most straggler causes can be identified and fixed Very different than existing mentality! Why so few workloads?

Example Spark task: network read compute disk write time : time to handle one record Very tricky to understand performance here!! Problem: traces have only coarse-grained breakdown of how tasks spend their time: This is how long the entire task took Fine-grained instrumentation needed to understand performance

How much faster would a job run if the network were infinitely fast? What’s an upper bound on the improvement from network optimizations?

How much faster could a task run if the network were infinitely fast? network read compute disk write Original task runtime : blocked on network Look at a task: a part of the job that runs on just one machine added instrumentation to measure when compute thread is blocked. Crucial point here: if the disk read is pipelined with computation, no point in speeding it up! : blocked on disk compute Task runtime with infinitely fast network

How much faster would a job run if the network were infinitely fast? time Task 0 Task 2 : time blocked on network 2 slots Task 1 to: Original job completion time Task 0 Task 1 Task 2 2 slots tn: Job completion time with infinitely fast network

SQL Workloads TPC-DS (20 machines, 850GB; 60 machines, 2.5TB) www.tpc.org/tpcds Big Data Benchmark (5 machines, 60GB) amplab.cs.berkeley.edu/benchmark Databricks (9 machines, tens of GB) databricks.com WHERE POSSIBLE: validated with larger workloads with Google / Microsoft / Facebook. But only possible to do on coarser scale. 2 versions of each: in-memory, on-disk

How much faster could jobs get from optimizing network performance? 5 95 75 25 50 Percentiles what are we looking at? 3 workloads on x-axis. One version of each workload: input/output stored on disk, input/output stored in-memory. 1Gbps network link – nothing insane. Median improvement at most 2%

How can we sanity check these numbers? How do our workloads compare to larger-scale traces? As mentioned, not enough data in traces to directly measure this.

How much data is transferred per CPU second? Compare to Facebook, where we have per-job information. Generally, much lower. For Microsoft / Google, only have aggregate information: total bytes written, total machine (or task) seconds. Also not CPU seconds – so would be slightly higher using the cpu metric. Says NOTHING about tail behavior. But still – if anything, much less network bound than the workload we looked at. Microsoft ’09-’10: 1.9–6.35 Mb / task second Google ’04-‘07: 1.34–1.61 Mb / machine second

Shuffle Data < Input Data How can this be true? Read off a point first! Graph is confusing: what is the ratio? Then say why this is: kind of intuitive, reducing data, except for ETL-style jobs, often want much less data. Sort workload not realistic! Neither is a workload that only does a shuffle. Shuffle Data < Input Data

What kind of hardware should I buy? 10Gbps networking hardware likely not necessary!

How much faster would jobs complete if the disk were infinitely fast?

How much faster could jobs get from optimizing disk performance? Includes all disk and network (impossible to separate these) Caps improvement from flash etc. Median improvement at most 19%

Disk Configuration Our instances: 2 disks, 8 cores Cloudera: At least 1 disk for every 3 cores As many as 2 disks for each core Our instances are under provisioned  results are upper bound

How much data is transferred per CPU second? BDB and TPC-DS are, if anything, more I/O bound than the other workloads BASED ON THESE AGGREGATE METRICS. Google: 0.8-1.5 MB / machine second Microsoft: 7-11 MB / task second

What does this mean about Spark versus Hadoop? This work: 19% serialized + compressed in-memory data serialized + compressed on-disk data Faster

This work says nothing about Spark vs. Hadoop! 19% 6x or more amplab.cs.berkeley.edu/benchmark/ up to 10x spark.apache.org (on-disk data) serialized + compressed in-memory data serialized + compressed on-disk data deserialized in-memory data Faster

What causes stragglers? Takeaway: causes depend on the workload, but disk and garbage collection common Fixing straggler causes can speed up other tasks too

Live demo

eecs.berkeley.edu/~keo/traces

I want your workloads! spark.eventLog.enabled true keo@cs.berkeley.edu

Network optimizations can reduce job completion time by at most 2% CPU (not I/O) often the bottleneck 19% reduction in completion time from optimizing disk Many straggler causes can be identified and fixed Project webpage (with links to paper and tool): eecs.berkeley.edu/~keo/traces Contact: keo@cs.berkeley.edu, @kayousterhout These are pretty different than what RR said.

Backup Slides

How do results change with scale?

How does the utilization compare?