Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Shark:SQL and Rich Analytics at Scale
Spark Streaming Large-scale near-real-time stream processing
Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Comp6611 Course Lecture Big data applications Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 Material adapted from.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Berkeley Data Analytics Stack
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Frameworks (and Stream Processing) Aditya Akella.
Overview of Spark project Presented by Yin Zhu Materials from Hadoop in Practice by A.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
In-Memory Cluster Computing for Iterative and Interactive Applications
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Spark Resilient Distributed Datasets:
In-Memory Cluster Computing for Iterative and Interactive Applications
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Outline | Motivation| Design | Results| Status| Future
High-Speed In-Memory Analytics over Hadoop and Hive Data
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Spark Streaming Large-scale near-real-time stream processing
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Data Engineering How MapReduce Works
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
John Kubiatowicz Electrical Engineering and Computer Sciences
Spark Debugger Ankur Dave, Matei Zaharia, Murphy McCauley, Scott Shenker, Ion Stoica UC BERKELEY.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Spark: Cluster Computing with Working Sets
Berkeley Data Analytics Stack - Apache Spark
Presented by Peifeng Yu
Spark Programming By J. H. Wang May 9, 2017.
Fast, Interactive, Language-Integrated Cluster Computing
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Spark and Scala.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Fast, Interactive, Language-Integrated Cluster Computing
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive, Language-Integrated Cluster Computing UC BERKELEY

Project Goals Extend the MapReduce model to better support two common classes of analytics apps: »Iterative algorithms (machine learning, graphs) »Interactive data mining Enhance programmability: »Integrate into Scala programming language »Allow interactive use from Scala interpreter

Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce InputOutput

Motivation Map Reduce InputOutput Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures Most current cluster programming models are based on acyclic data flow from stable storage to stable storage

Motivation Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: »Iterative algorithms (machine learning, graphs) »Interactive data mining tools (R, Excel, Python) With current frameworks, apps reload data from stable storage on each query

Solution: Resilient Distributed Datasets (RDDs) Allow apps to keep working sets in memory for efficient reuse Retain the attractive properties of MapReduce »Fault tolerance, data locality, scalability Support a wide range of applications

Outline Spark programming model Implementation Demo User applications

Programming Model Resilient distributed datasets (RDDs) »Immutable, partitioned collections of objects »Created through parallel transformations (map, filter, groupBy, join, …) on data in stable storage »Can be cached for efficient reuse Actions on RDDs »Count, reduce, collect, save, …

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count... tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)).map(_.split(‘\t’)(2)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))

Example: Logistic Regression Goal: find best line separating two sets of points + – – – – – – – – – + target – random initial line

Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s

Spark Applications In-memory data mining on Hive data (Conviva) Predictive analytics (Quantifind) City traffic prediction (Mobile Millennium) Twitter spam classification (Monarch) Collaborative filtering via matrix factorization …

Conviva GeoReport Aggregations on many keys w/ same WHERE clause 40× gain comes from: »Not re-reading unused columns or filtered records »Avoiding repeated decompression »In-memory storage of deserialized objects Time (hours)

Frameworks Built on Spark Pregel on Spark (Bagel) »Google message passing model for graph computation »200 lines of code Hive on Spark (Shark) »3000 lines of code »Compatible with Apache Hive »ML operators in Scala

Implementation Runs on Apache Mesos to share resources with Hadoop & other apps Can read from any Hadoop input source (e.g. HDFS) Spark Hadoop MPI Mesos Node … No changes to Scala compiler

Spark Scheduler Dryad-like DAGs Pipelines functions within a stage Cache-aware work reuse & locality Partitioning-aware to avoid shuffles join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = cached data partition

Interactive Spark Modified Scala interpreter to allow Spark to be used interactively from the command line Required two changes: »Modified wrapper code generation so that each line typed has references to objects for its dependencies »Distribute generated classes over the network

Demo

Conclusion Spark provides a simple, efficient, and powerful programming model for a wide range of apps Download our open source release:

Related Work DryadLINQ, FlumeJava »Similar “distributed collection” API, but cannot reuse datasets efficiently across queries Relational databases »Lineage/provenance, logical logging, materialized views GraphLab, Piccolo, BigTable, RAMCloud »Fine-grained writes similar to distributed shared memory Iterative MapReduce (e.g. Twister, HaLoop) »Implicit data sharing for a fixed computation pattern Caching systems (e.g. Nectar) »Store data in files, no explicit control over what is cached

Behavior with Not Enough RAM

Fault Recovery Results

Spark Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey