Spark: Cluster Computing with Working Sets

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Hadoop: The Definitive Guide Chap. 2 MapReduce
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
In-Memory Cluster Computing for Iterative and Interactive Applications
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Storage in Big Data Systems
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Engineering How MapReduce Works
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
PySpark Tutorial - Learn to use Apache Spark with Python
Spark: Cluster Computing with Working Sets
Presented by Peifeng Yu
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Spark.
Spark Presentation.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
PREGEL Data Management in the Cloud
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
MapReduce Simplied Data Processing on Large Clusters
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Introduction to Spark.
CS639: Data Management for Data Science
Apache Hadoop and Spark
Fast, Interactive, Language-Integrated Cluster Computing
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Overviews Motivation Spark Resilient Distributed Datasets(RDD) Shared Variables Experiments

Motivation

Hadoop Advantages: Scalability Fault Tolerance Disadvantage: Acyclic :not suitable for iterative algorithms

Disadvantage of Hadoop(1) Iterative jobs: Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter. Interactive analytics: When querying on large datasets, a user would be able to load a dataset of interest into memory across a number of machines and query it repeatedly.

Disadvantage of Hadoop(2) In most current frameworks, the only way to reuse data between computations (e.g., between two MapReduce jobs) is to write it to an external stable storage system, e.g., a distributed file system. This incurs substantial overheads due to data replication, disk I/O, and serialization, which can dominate application execution times.

Spark

Programming Model Driver Program: implements the high-level control flow of their application and launches various operations in parallel. Resilient Distributed Datasets(RDD) Parallel Operations: passing a function to apply on a dataset. Shared Variables: they can be used in functions running on the cluster.

Framework of Spark

Resilient Distributed Datasets (RDD)

Resilient Distributed Datasets A resilient distributed dataset (RDD) is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. The elements of an RDD need not exist in physical storage; instead, a handle to an RDD contains enough information to compute the RDD starting from data in reliable storage. This means that RDDs can always be reconstructed if nodes fail.

Changing the Persistence of RDD By default, RDDs are lazy and ephemeral. User can alter the persistence of an RDD through two actions: Cache action: hints that it should be kept in memory after the first time it is computed, because it will be reused. Save action: evaluates the dataset and writes it to a distributed filesystem such as HDFS

Operations on RDD(1) Transformations(create an RDD): operations on either (1) data in stable storage or (2) others RDDs. Examples of transformations include map, filter, and join.

Operations on RDD(2) Actions: operations that return a value to the application or export data to a storage system. Examples of actions include count, collect, and save.

Operations on RDD(3) Programmers can call a persist method to indicate which RDDs they want to reuse in future operations. Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM. Users can set a persistence priority on each RDD to specify which in-memory data should spill to disk first.

Example1 of RDD These datasets will be stored as a chain of objects capturing the lineage of each RDD. Each dataset object contains a pointer to its parent and information about how the parent was transformed.

Lineage Chain of Example1

Example2 of RDD

Lineage Chain of Example2

RDD Objects Each RDD object implements the same simple interface, which consists of five piece of information:

Examples of RDD Operations HDFS files(lines): partitions(): returns one partition for each block of the file (with the block’s offset stored in each Partition object) preferredLocations: gives the nodes the block is on iterator: reads the block

Dependencies between RDDs(1) Narrow Dependencies: each partition of the parent RDD is used by at most one partition of the child RDD(1:1). Map leads to a narrow dependency. Wide Dependencies: multiple child partitions may depend on it(1:N). Join leads to wide dependencies.

Dependencies between RDDs(2) Narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions. For example, one can apply a map followed by a filter on an element-by-element basis. Wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce like operation. Recovery after a node failure is more efficient with a narrow dependency than the ones with wide dependency.

Parallel Operations reduce: Combines dataset elements using an associative function to produce a result at the driver program. collect: Sends all elements of the dataset to the driver program.

Shared Variables

Shared Variables Programmers invoke operations like map, filter and reduce by passing closures (functions) to Spark. Normally, when Spark runs a closure on a worker node, these variables are copied to the worker. However, Spark also lets programmers create two restricted types of shared variables to support two simple but common usage patterns.

Broadcast Variables When one creates a broadcast variable b with a value v, v is saved to a file in a shared file system. The serialized form of b is a path to this file. When b’s value is queried on a worker node, Spark first checks whether v is in a local cache, and reads it from the file system if it isn’t.

Accumulators Each accumulator is given a unique ID when it is created. When the accumulator is saved, its serialized form contains its ID and the “zero” value for its type. On the workers, a separate copy of the accumulator is created for each thread that runs a task using thread-local variables, and is reset to zero when a task begins. After each task runs, the worker sends a message to the driver program containing the updates it made to various accumulators.

Experiments

Results(1) Dataset: 29G Envirement: 20 “m1.xlarge” EC2 nodes with 4 cores each. Algorithm: logistic regression

Results(2)

Results(3)

Thank You!!!