UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Slides:

Advertisements

Similar presentations

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,

Advertisements

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

THE DATACENTER NEEDS AN OPERATING SYSTEM MATEI ZAHARIA, BENJAMIN HINDMAN, ANDY KONWINSKI, ALI GHODSI, ANTHONY JOSEPH, RANDY KATZ, SCOTT SHENKER, ION STOICA.

Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

The Hadoop Stack, Part 3 Introduction to Spark

Spark: Cluster Computing with Working Sets

Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Spark Fast, Interactive, Language-Integrated Cluster Computing.

Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

Berkley Data Analysis Stack (BDAS)

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

In-Memory Cluster Computing for Iterative and Interactive Applications

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

In-Memory Cluster Computing for Iterative and Interactive Applications

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

In-Memory Cluster Computing for Iterative and Interactive Applications

Outline | Motivation| Design | Results| Status| Future

A Platform for Fine-Grained Resource Sharing in the Data Center

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica Good morning everyone. My name is Haoyuan,

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

An Introduction to HDInsight June 27 th,

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.

Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.

Matthew Winter and Ned Shawa

Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

A Platform for Fine-Grained Resource Sharing in the Data Center

Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

Massive Data Processing – In-Memory Computing & Spark Stream Process.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Image taken from: slideshare

Spark: Cluster Computing with Working Sets

Big Data is a Big Deal!.

PROTECT | OPTIMIZE | TRANSFORM

Fast, Interactive, Language-Integrated Cluster Computing

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Spark Presentation.

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Introduction to Spark.

CS110: Discussion about Spark

Introduction to Apache

Overview of big data tools

Spark and Scala.

Introduction to Spark.

Fast, Interactive, Language-Integrated Cluster Computing

Big-Data Analytics with Azure HDInsight

Lecture 29: Distributed Systems

Presentation transcript:

UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica

Outline Background: Nexus project Spark goals Programming model Example jobs Implementation Interactive Spark

Nexus Background Rapid innovation in cluster computing frameworks Dryad Apache Hama Pregel Pig

Problem Rapid innovation in cluster computing frameworks No single framework optimal for all applications Want to run multiple frameworks in a single cluster »…to maximize utilization »…to share data between frameworks »…to isolate workloads

Solution Nexus is an “operating system” for the cluster over which diverse frameworks can run »Nexus multiplexes resources between frameworks »Frameworks control job execution

Nexus slave Nexus master Hadoop v20 scheduler Nexus slave Hadoop job Hadoop v20 executor task Nexus slave Hadoop v19 executor task MPI scheduler MPI job MPI execut or task Nexus Architecture Hadoop v19 scheduler Hadoop job Hadoop v19 executor task MPI execut or task

Nexus Status Prototype in 7000 lines of C++ Ported frameworks: » Hadoop (900 line patch) » MPI (160 line wrapper scripts) New frameworks: » Spark, Scala framework for iterative jobs & more » Apache+haproxy, elastic web server farm (200 lines)

Outline Background: Nexus project Spark goals Programming model Example job Implementation Interactive Spark

Spark Goals Support iterative jobs »Machine learning researchers in our lab identified this as a workload that Hadoop doesn’t perform well on Experiment with programmability »Leverage Scala to integrate cleanly into programs »Support interactive use from Scala interpreter Retain MapReduce’s fine-grained fault-tolerance

Programming Model Distributed datasets »HDFS files, “parallelized” Scala collections »Can be transformed with map and filter »Can be cached across parallel operations Parallel operations »Foreach, reduce, collect Shared variables »Accumulators (add-only) »Broadcast variables (read-only)

Example 1: Logistic Regression

Logistic Regression Goal: find best line separating two sets of points + – – – – – – – – – + target – random initial line

Serial Version val data = readData(...) var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = Vector.zeros(D) for (p <- data) { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x } w -= gradient } println("Final w: " + w)

Spark Version val data = spark.hdfsTextFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator(Vector.zeros(D)) for (p <- data) { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x } w -= gradient.value } println("Final w: " + w)

Spark Version val data = spark.hdfsTextFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator(Vector.zeros(D)) for (p <- data) { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x } w -= gradient.value } println("Final w: " + w)

Spark Version val data = spark.hdfsTextFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator(Vector.zeros(D)) data.foreach(p => { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x }) w -= gradient.value } println("Final w: " + w)

Functional Programming Version val data = spark.hdfsTextFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { w -= data.map(p => { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y scale * p.x }).reduce(_+_) } println("Final w: " + w)

Job Execution Big Dataset Slave 4 Slave 3 Slave 2 Slave 1 Master R1 R2R3R4 aggregate update param param Spark

Job Execution Slave 4 Slave 3 Slave 2 Slave 1 Master R1 R2R3R4 aggregate update param param Master aggregate param Map 4 Map 3 Map 2 Map 1 Reduce aggregate Map 8 Map 7 Map 6 Map 5 Reduce param      SparkHadoop / Dryad

Performance 127 s / iteration first iteration 174 s further iterations 6 s

Example 2: Alternating Least Squares

Collaborative Filtering Predict movie ratings for a set of users based on their past ratings R = 1??45?3??35??35?5???14????2?1??45?3??35??35?5???14????2? Movies Users

Matrix Factorization Model R as product of user and movie matrices A and B of dimensions U×K and M×K RA = Problem: given subset of R, optimize A and B BTBT

Alternating Least Squares Algorithm Start with random A and B Repeat: 1.Fixing B, optimize A to minimize error on scores in R 2.Fixing A, optimize B to minimize error on scores in R

Serial ALS val R = readRatingsMatrix(...) var A = (0 until U).map(i => Vector.random(K)) var B = (0 until M).map(i => Vector.random(K)) for (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R)) }

Naïve Spark ALS val R = readRatingsMatrix(...) var A = (0 until U).map(i => Vector.random(K)) var B = (0 until M).map(i => Vector.random(K)) for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices).map(i => updateUser(i, B, R)).collect() B = spark.parallelize(0 until M, numSlices).map(i => updateMovie(i, A, R)).collect() } Problem: R re-sent to all nodes in each parallel operation

Efficient Spark ALS val R = spark.broadcast(readRatingsMatrix(...)) var A = (0 until U).map(i => Vector.random(K)) var B = (0 until M).map(i => Vector.random(K)) for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices).map(i => updateUser(i, B, R.value)).collect() B = spark.parallelize(0 until M, numSlices).map(i => updateMovie(i, A, R.value)).collect() } Solution: mark R as broadcast variable

ALS Performance

Subseq. Iteration Breakdown 36% of iteration spent on broadcast

Outline Background: Nexus project Spark goals Programming model Example job Implementation Interactive Spark

Architecture Driver program connects to Nexus and schedules tasks Workers run tasks, report results and variable updates Data shared with HDFS/NFS No communication between workers for now Driver Workers HDFS user code, broadcast vars tasks, results Nexus local cache

Distributed Datasets Each distributed dataset object maintains a lineage that is used to rebuild slices that are lost / fall out of cache Ex: errors = textFile(“log”).filter(_.contains(“error”)).map(_.split(‘\t’)(1)).cache() HdfsFile path: hdfs://… HdfsFile path: hdfs://… FilteredFile func: contains(...) FilteredFile func: contains(...) MappedFile func: split(…) MappedFile func: split(…) CachedFile HDFS Local cache getIterator(slice)

Language Integration Scala closures are Serializable objects »Serialize on driver, load & run on workers Not quite enough »Nested closures may reference entire outer scope »May pull in non-Serializable variables not used inside »Solution: bytecode analysis + reflection Shared variables »Accumulators: serialized form contains ID »Broadcast vars: serialized form is path to HDFS file

Interactive Spark Modified Scala interpreter to allow Spark to be used interactively from the command line Required two changes: »Modified wrapper code generation so that each “line” typed has references to objects for its dependencies »Place generated classes in distributed filesystem Enables in-memory exploration of big data

Demo

Conclusions Spark provides two abstractions that enable iterative jobs and interactive use: 1.Distributed datasets with controllable persistence, supporting fault-tolerant parallel operations 2.Shared variables for efficient broadcast and imperative style programming Language integration achieved using Scala features + some amount of hacking All this is surprisingly little code (~1600 lines)

Related Work DryadLINQ »SQL-like queries integrated in C# programs »Build queries through operations on lazy datasets »Cannot have a dataset persist across queries »No concept of shared variables for broadcast etc Pig & Hive »Query languages that can call into Java/Python/etc UDFs »No support for caching a dataset across queries OpenMP »Compiler extension for parallel loops in C++ »Annotate variables as read-only or accumulator above loop »Cluster version exists, but not fault-tolerant

Future Work Open-source Spark and Nexus »Probably this summer »Very interested in getting users! Understand which classes of algorithms we can handle and how to extend Spark for others Build higher-level interfaces on top of interactive Spark (e.g. R, SQL)

Questions ? ? ?