Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Spark Streaming Large-scale near-real-time stream processing
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Frameworks (and Stream Processing) Aditya Akella.
Overview of Spark project Presented by Yin Zhu Materials from Hadoop in Practice by A.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Distributed Computations
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Spark Resilient Distributed Datasets:
In-Memory Cluster Computing for Iterative and Interactive Applications
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Outline | Motivation| Design | Results| Status| Future
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Storage in Big Data Systems
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Spark Streaming Large-scale near-real-time stream processing
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Data Engineering How MapReduce Works
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Spark Debugger Ankur Dave, Matei Zaharia, Murphy McCauley, Scott Shenker, Ion Stoica UC BERKELEY.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Spark: Cluster Computing with Working Sets
Presented by Peifeng Yu
Spark Programming By J. H. Wang May 9, 2017.
Fast, Interactive, Language-Integrated Cluster Computing
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Fast, Interactive, Language-Integrated Cluster Computing
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica 712 citations

Motivation Map reduce based tasks are slow Sharing of data across jobs is stable storage Replication of data and disk I/O Support iterative algorithms Support interactive data mining tools – search

Existing literature on large distributed algorithms on clusters General : Language-integrated “distributed dataset” API, but cannot share datasets efficiently across queries Map Reduce Map Shuffle Reduce DyradLinq Ciel Specific : Specialized models; can’t run arbitrary / ad-hoc queries Pregel – Google’s graph based Haloop – iterative Hadoop

(Cont) Caching systems Nectar – Automatic expression caching, but over distributed FS Ciel – not explicit control over cached data PacMan - Memory cache for HDFS, but writes still go to network/disk Lineage To track dependency of task information across a DAG of tasks

Resilient Distributed Datasets (RDDs) Restricted form of distributed shared memory Read only/ Immutable, partitioned collections of records Deterministic From coarse grained operations (map, filter, join, etc.) From stable storage or other RDDs User controlled persistence User controlled partitioning

Representing RDDs No need of check pointing Checkpointing

Spark programming interface Lazy operations Transformations not done until action Operations on RDDs Transformations - build new RDDs Actions - compute and output results Partitioning – layout across nodes Persistence – storage in RAM / Disc

RDD on Spark

Example : Console Log mining

Example : Logistic regression Classification problem that searches for hyper plane w Transforms text to point objects Repetitive map and reduce to compute gradient

Example : PageRank Start each page with rank 1/N. On each iteration update the page rank = Σ i ∈ neighbors rank i / |neighbors |

PageRank performance

RDDs versus DSMs RDDs unsuitable for applications that make asynchronous fine- grained updates to shared state, -storage system for a web application -an incremental web crawler Lookup by key

Implementation in Spark Job scheduler Data locality captured using delay scheduling Interpreter integration Class shipping Modified code generation Memory management In memory and swap memory LRU Support for checkpointing Good for long lineage graphs

Evaluation Runs on Mesos to share clusters with Hadoop Can read from any Hadoop input source (HDFS or HBase) RDD implemented in Spark Ability to be used over any other cluster systems as well

Iterative ML applications scalability No improvement in successive iterations Slow due to heartbeat signals Initially slow due to conversion of text to binary in-Mem and java objects

Understanding Speedup Reading from HDFS costs 2 seconds 10 second difference Text to binary parsing = 7 sec Conversion of binary record to Java obj 3sec

Failure in RDD RDDs track the graph of transformations that built them (their lineage) to rebuild lost data

In sufficient memory

User applications using Spark In memory analytics at Conviva : 40x speedup Traffic modeling (Traffic prediction via EM - Mobile Millennium) Twitter spam classification(Monarch) DNA sequence analysis (SNAP)

Good RDDs offer a simple and efficient programming model Open source and scalable implementation at Spark Improves the speed to the memory bandwidth limit – good for batch processes Improvements Memory leak if too many RDDs loaded - garbage collection to be built in Uses LRU – better memory replacement algorithms possible Handling data locality using partition/hash and delay scheduling Hybrid system for handling fine grained updates Use for debugging

Thank you!

Job scheduler