CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Comp6611 Course Lecture Big data applications Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 Material adapted from.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Frameworks (and Stream Processing) Aditya Akella.
Overview of Spark project Presented by Yin Zhu Materials from Hadoop in Practice by A.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Distributed Computations
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Spark Resilient Distributed Datasets:
In-Memory Cluster Computing for Iterative and Interactive Applications
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Outline | Motivation| Design | Results| Status| Future
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Storage in Big Data Systems
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
Operating Systems and The Cloud, Part II: Search => Cluster Apps => Scalable Machine Learning David E. Culler CS162 – Operating Systems and Systems Programming.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Spark: Cluster Computing with Working Sets
Berkeley Data Analytics Stack - Apache Spark
Presented by Peifeng Yu
Spark Programming By J. H. Wang May 9, 2017.
Fast, Interactive, Language-Integrated Cluster Computing
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Introduction to Spark.
CS639: Data Management for Data Science
CS : Project Suggestions
Fast, Interactive, Language-Integrated Cluster Computing
Database Systems 12 Distributed Analytics
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng CSE, CUHK

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica NSDI

Motivation MapReduce greatly simplified “big data” analysis on large, unreliable clusters But as soon as it got popular, users wanted more: – More complex, multi-stage applications (e.g. iterative machine learning & graph processing) – More interactive ad-hoc queries Response: specialized frameworks for some of these apps (e.g. Pregel for graph processing) 3

Motivation Complex apps and interactive queries require reuse of intermediate results across multiple computations (common in iterative ML, DM, and graph algorithms) Thus, need efficient primitives for data sharing MapReduce lacks abstraction for leveraging distributed memory The only way to share data across jobs is stable storage  slow! 4

Examples iter. 1 iter Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3... HDFS read Slow due to replication and disk I/O, but necessary for fault tolerance 5

iter. 1 iter Input Goal: In-Memory Data Sharing Input query 1 query 2 query 3... one-time processing × faster than network/disk, but how to get FT? 6

Challenge How to design a distributed memory abstraction that is both fault-tolerant and efficient? 7

Challenge Existing storage abstractions have interfaces based on fine-grained updates to mutable state – RAMCloud, databases, distributed mem, Piccolo Requires replicating data or logs across nodes for fault tolerance – Costly for data-intensive apps – x slower than memory write 8

Solution: Resilient Distributed Datasets (RDDs) Restricted form of distributed shared memory – Immutable, partitioned collections of records – Can only be built through coarse-grained deterministic transformations (map, filter, join, …) Efficient fault recovery using lineage – Log transformations used to build a dataset (i.e., its lineage) rather than the actual data – If a partition of an RDD is lost, the RDD has enough info about how it was derived from other RDDs in order to recompute the partition 9

Input query 1 query 2 query 3... RDD Recovery one-time processing iter. 1 iter Input 10

Generality of RDDs Despite their restrictions, RDDs can express surprisingly many parallel algorithms – These naturally apply the same operation to many items Unify many current programming models – Data flow models: MapReduce, Dryad, SQL, … – Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental, … Support new apps that these models don’t 11

What RDDs are good and not good for Good for: – batch applications that apply the same operation to all elements of a dataset – RDDs can efficiently remember each transformation as one step in a lineage graph – recovery of lost partitions w/o having to log large amounts of data Not good for: – applications that need asynchronous fine-grained updates to shared state 12

Memory bandwidth Network bandwidth Tradeoff Space Granularity of Updates Write Throughput Fine Coarse LowHigh K-V stores, databases, RAMCloud Best for batch workloads Best for transactional workloads HDFSRDDs 13

How to represent RDDs Need to track lineage across a wide range of transformations A graph-based representation A common interface w/: – a set of partitions: atomic pieces of the dataset – a set of dependencies on parent RDDs – a function for computing the dataset based on its parents – metadata about its partitioning scheme and data placement 14

RDD Dependencies Narrow dependencies: each partition of the parent RDD is used by at most one partition of the child RDD 15 Good for pipelined execution on one cluster node, e.g., apply a map followed by a filter on an element-by- element basis. Efficient recovery as only the lost parent partitions need to be recomputed (in parallel on different nodes).

RDD Dependencies Wide dependencies: each partition of the parent RDD is used by multiple child partitions 16 Require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce- like operation. Expensive recovery, as a complete re-execution is needed.

Outline Spark programming interface Implementation How people are using Spark 17

Outline Spark programming interface Implementation How people are using Spark 18

Spark Programming Interface DryadLINQ-like API in the Scala language Used interactively from the Scala interpreter Provides: – Resilient distributed datasets (RDDs) – Operations on RDDs: transformations & actions – Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc) 19

Spark Programming Interface Transformations – lazy operations that define new RDDs Actions – launch a computation to return a value to the program or write data to external storage 20

Spark Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey 21

Spark Operations 22

Programming in Spark Programmers start by defining one or more RDDs through transformations on data in stable storage (e.g., map, filter) The RDDs can then be used in actions (e.g., count, collect, save) Spark computes RDDs (lazily) the first time they are used in an action 23

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Block 1 Block 2 Block 3 Worker Master messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count tasks results Msgs. 1 Msgs. 2 Msgs. 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) 24

RDDs track the graph of transformations that built them (their lineage) to rebuild lost data E.g.: messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) FilteredRDD func = _.contains(...) MappedRDD func = _.split(…) MappedRDD func = _.split(…) Fault Recovery HadoopRDDFilteredRDDMappedRDD 25

Fault Recovery Results Failure happens 26

Example: PageRank 1. Start each page with a rank of 12. On each iteration, update each page’s rank to Σ i ∈ neighbors rank i / |neighbors i | links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _) } 27

Optimizing Placement links & ranks repeatedly joined Can co-partition them (e.g. hash both on URL) to avoid shuffles Can also use app knowledge, e.g., hash on DNS name links = links.partitionBy( new URLPartitioner()) reduce Contribs 0 join Contribs 2 Ranks 0 (url, rank) Ranks 0 (url, rank) Links (url, neighbors) Links (url, neighbors)... Ranks 2 reduce Ranks 1 28

PageRank Performance 29

Behavior with Insufficient RAM 30

Scalability Logistic RegressionK-Means 31

Outline Spark programming interface Implementation How people are using Spark 32

Implementation Runs on Mesos cluster manager [NSDI 11] to share clusters w/ Hadoop, MPI, etc. Can read from any Hadoop input source (HDFS, S3, …) Spark Hadoop MPI Mesos Node … 33

Task Scheduler Dryad-like DAGs Pipelines functions within a stage Locality & data reuse aware Partitioning-aware to avoid shuffles join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = cached data partition 34

Programming Models Implemented on Spark RDDs can express many existing parallel models – MapReduce, DryadLINQ – Pregel graph processing [200 LOC] – Iterative MapReduce [200 LOC] – SQL: Hive on Spark (Shark) Enables apps to efficiently intermix these models All are based on coarse-grained operations 35

Outline Spark programming interface Implementation How people are using Spark 36

Open Source Community User applications: – Data mining 40x faster than Hadoop (Conviva) – Exploratory log analysis (Foursquare) – Traffic prediction via EM (Mobile Millennium) – Twitter spam classification (Monarch) – DNA sequence analysis (SNAP) –... 37

Conclusion RDDs offer a simple and efficient programming model for a broad range of applications Leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery 38