In-Memory Cluster Computing for Iterative and Interactive Applications

Slides:

Advertisements

Similar presentations

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,

Advertisements

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)

Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.

Comp6611 Course Lecture Big data applications Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 Material adapted from.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Berkeley Data Analytics Stack

Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.

Spark: Cluster Computing with Working Sets

Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

In-Memory Frameworks (and Stream Processing) Aditya Akella.

Overview of Spark project Presented by Yin Zhu Materials from Hadoop in Practice by A.

Spark Fast, Interactive, Language-Integrated Cluster Computing.

Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.

Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,

Berkley Data Analysis Stack (BDAS)

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Fast and Expressive Big Data Analytics with Python

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

In-Memory Cluster Computing for Iterative and Interactive Applications

Distributed Computations

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Spark Resilient Distributed Datasets:

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

In-Memory Cluster Computing for Iterative and Interactive Applications

Outline | Motivation| Design | Results| Status| Future

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Storage in Big Data Systems

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Spark Streaming Large-scale near-real-time stream processing

Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –

1 CS : Project Suggestions Ion Stoica and Ali Ghodsi ( September 14, 2015.

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.

Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.

Data Engineering How MapReduce Works

Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.

Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,

Spark Debugger Ankur Dave, Matei Zaharia, Murphy McCauley, Scott Shenker, Ion Stoica UC BERKELEY.

Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.

Massive Data Processing – In-Memory Computing & Spark Stream Process.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.

Spark: Cluster Computing with Working Sets

Berkeley Data Analytics Stack - Apache Spark

Presented by Peifeng Yu

Fast, Interactive, Language-Integrated Cluster Computing

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Spark Presentation.

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Introduction to Spark.

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

CS639: Data Management for Data Science

Fast, Interactive, Language-Integrated Cluster Computing

Lecture 29: Distributed Systems

CS639: Data Management for Data Science

Presentation transcript:

In-Memory Cluster Computing for Iterative and Interactive Applications Spark In-Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC BERKELEY

Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce Input Output Acyclic

Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce Input Output Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures Also applies to Dryad, SQL, etc Benefits: easy to do fault tolerance and

Motivation Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: Iterative algorithms (machine learning, graphs) Interactive data mining tools (R, Excel, Python) With current frameworks, apps reload data from stable storage on each query Point out that both types of apps are actually quite common / desirable in data analytics

Example: Iterative Apps iteration 1 result 1 iteration 2 result 2 iteration 3 result 3 Input . . . Each iteration is, for example, a MapReduce job iter. 1 iter. 2 . . . Input

Goal: Keep Working Set in RAM iteration 1 one-time processing iteration 2 iteration 3 Input Distributed memory . . . iter. 1 iter. 2 . . . Input

Challenge Distributed memory abstraction must be Fault-tolerant Efficient in large commodity clusters How do we design a programming interface that can provide fault tolerance efficiently?

Challenge Existing distributed storage abstractions offer an interface based on fine-grained updates Reads and writes to cells in a table E.g. key-value stores, databases, distributed memory Requires replicating data or update logs across nodes for fault tolerance Expensive for data-intensive apps

Solution: Resilient Distributed Datasets (RDDs) Offer an interface based on coarse-grained transformations (e.g. map, group-by, join) Allows for efficient fault recovery using lineage Log one operation to apply to many elements Recompute lost partitions of dataset on failure No cost if nothing fails

RDD Recovery iteration 1 iteration 2 iteration 3 Input one-time processing iteration 2 iteration 3 Input Distributed memory . . . iter. 1 iter. 2 . . . Input

Generality of RDDs Despite coarse-grained interface, RDDs can express surprisingly many parallel algorithms These naturally apply the same operation to many items Unify many current programming models Data flow models: MapReduce, Dryad, SQL, … Specialized tools for iterative apps: BSP (Pregel), iterative MapReduce (Twister), incremental (CBP) Also support new apps that these models don’t Key idea: add “variables” to the “functions” in functional programming

Outline Programming model Applications Implementation Demo Current work

Spark Programming Interface Language-integrated API in Scala Can be used interactively from Scala interpreter Provides: Resilient distributed datasets (RDDs) Transformations to build RDDs (map, join, filter, …) Actions on RDDs (return a result to the program or write data to storage: reduce, count, save, …) You write a single program  similar to DryadLINQ Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization Mention cached vars useful for some workloads that won’t be shown here Mention it’s all designed to be easy to distribute in a fault-tolerant fashion

Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Msgs. 1 Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Worker Driver results tasks Block 1 Action messages.filter(_.contains(“foo”)).count Key idea: add “variables” to the “functions” in functional programming messages.filter(_.contains(“bar”)).count Msgs. 2 . . . Msgs. 3 Block 2 Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

filter (func = _.startsWith(...)) RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) HDFSFile FilteredRDD MappedRDD filter (func = _.startsWith(...)) map (func = _.split(...))

Example: Logistic Regression Goal: find best line separating two sets of points random initial line + + + + + + – + + – Note that dataset is reused on each gradient computation – – + + – – – – – – target

Logistic Regression Code val data = spark.textFile(...).map(readPoint).persist() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce((a,b) => a+b) w -= gradient } println("Final w: " + w)

Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

RDDs in More Detail RDDs additionally provide: Control over partitioning (e.g. hash-based), which can be used to optimize data placement across queries Control over persistence (e.g. store on disk vs in RAM) Fine-grained reads (treat RDD as a big table)

RDDs vs Distributed Shared Memory Concern RDDs Distr. Shared Mem. Reads Fine-grained Writes Bulk transformations Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-overhead using lineage Requires checkpoints and program rollback Straggler mitigation Possible using speculative execution Difficult Work placement Automatic based on data locality Up to app (but runtime aims for transparency)

Spark Applications EM alg. for traffic prediction (Mobile Millennium) In-memory OLAP & anomaly detection (Conviva) Twitter spam classification (Monarch) Alternating least squares matrix factorization Pregel on Spark (Bagel) SQL on Spark (Shark)

Mobile Millennium App Estimate city traffic using position observations from “probe vehicles” (e.g. GPS-carrying taxis)

Sample Data One day’s worth of SF taxi measurements Tim Hunter, with the support of the Mobile Millennium team P.I. Alex Bayen (traffic.berkeley.edu)

Implementation Expectation maximization (EM) algorithm using iterated map and groupByKey on the same data 3x faster with in-memory RDDs

Conviva GeoReport Aggregations on many keys w/ same WHERE clause Time (hours) Aggregations on many keys w/ same WHERE clause 40× gain comes from: Not re-reading unused columns or filtered records Avoiding repeated decompression In-memory storage of deserialized objects

Implementation Spark runs on the Mesos cluster manager [NSDI 11], letting it share resources with Hadoop & other apps Can read from any Hadoop input source (HDFS, S3, …) Spark Hadoop MPI Mesos Node … Mention it’s designed to be fault-tolerant ~10,000 lines of code, no changes to Scala

RDD Representation Simple common interface: Set of partitions Preferred locations for each partition List of parent RDDs Function to compute a partition given parents Optional partitioning info Allows capturing wide range of transformations Users can easily add new transformations

Scheduler Dryad-like task DAG Pipelines functions within a stage Cache-aware for data reuse & locality Partitioning-aware to avoid shuffles join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: NOT a modified version of Hadoop = cached partition

Language Integration Scala closures are Serializable Java objects Serialize on driver, load & run on workers Not quite enough Nested closures may reference entire outer scope May pull in non-Serializable variables not used inside Solution: bytecode analysis + reflection

Interactive Spark Modified Scala interpreter to allow Spark to be used interactively from the command line Altered code generation to make each “line” typed have references to objects it depends on Added facility to ship generated classes to workers Enables in-memory exploration of big data

Outline Spark programming model Applications Implementation Demo Current work

Spark Debugger Debugging general distributed apps is very hard Idea: leverage the structure of computations in Spark, MapReduce, Pregel, and other systems These split jobs into independent, deterministic tasks for fault tolerance; use this for debugging: Log lineage for all RDDs created (small) Let user replay any task in jdb, or rebuild any RDD

Conclusion Spark’s RDDs offer a simple and efficient programming model for a broad range of apps Achieve fault tolerance efficiently by providing coarse-grained operations and tracking lineage Can express many current programming models www.spark-project.org

Related Work DryadLINQ, FlumeJava Relational databases Similar “distributed dataset” API, but cannot reuse datasets efficiently across queries Relational databases Lineage/provenance, logical logging, materialized views GraphLab, Piccolo, BigTable, RAMCloud Fine-grained writes similar to distributed shared memory Iterative MapReduce (e.g. Twister, HaLoop) Cannot define multiple distributed datasets, run different map/reduce pairs on them, or query data interactively

(return a result to driver program) Spark Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey

Fault Recovery Results These are for K-means

Behavior with Not Enough RAM

PageRank Results

Pregel Graph processing system based on BSP model Vertices in the graph have states At each superstep, each vertex can update its state and send messages to vertices in next step

Pregel Using RDDs . . . map Input graph Vertex states0 Messages0 group by vertex ID Superstep 1 map Vertex states1 Messages1 group by vertex ID Superstep 2 map Vertex states2 Messages2 . . .

Pregel Using RDDs . . . map Input graph Vertex states0 Messages0 group by vertex ID verts = // RDD of (ID, State) pairs msgs = // RDD of (ID, Message) pairs newData = verts.cogroup(msgs).mapValues( (id, vert, msgs) => userFunc(id, vert, msgs) // gives (id, newState, outgoingMsgs) ).persist() newVerts = newData.mapValues((v,ms) => v) newMsgs = newData.flatMap((id,(v,ms)) => ms) Superstep 1 map Vertex states1 Messages1 group by vertex ID Superstep 2 map Vertex states2 Messages2 . . .

Placement Optimizations Partition vertex RDDs in same way across steps So that states never need to be sent across nodes Use a custom partitioner to minimize traffic (e.g. group pages by domain) Easily expressible with RDD partitioning