Spark.

Slides:



Advertisements
Similar presentations
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Advertisements

Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Storage in Big Data Systems
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Data Engineering How MapReduce Works
SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext.
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
PySpark Tutorial - Learn to use Apache Spark with Python
Python Spark Intro for Data Science
CS (borrowing heavily from slides by Kay Ousterhout)
Image taken from: slideshare
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Spark: Cluster Computing with Working Sets
Berkeley Data Analytics Stack - Apache Spark
CC Procesamiento Masivo de Datos Otoño Lecture 8: Apache Spark (Core)
Big Data is a Big Deal!.
Spark Programming By J. H. Wang May 9, 2017.
PROTECT | OPTIMIZE | TRANSFORM
Concept & Examples of pyspark
Fast, Interactive, Language-Integrated Cluster Computing
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
ITCS-3190.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Hadoop Tutorials Spark
Hadoop MapReduce Framework
Spark Presentation.
Data Platform and Analytics Foundational Training
Iterative Computing on Massive Data Sets
Distributed Computing with Spark
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Kishore Pusukuri, Spring 2018
COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Overview of big data tools
Spark and Scala.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Spark and Scala.
Introduction to Spark.
CS639: Data Management for Data Science
Apache Hadoop and Spark
CS5412 / Lecture 25 Apache Spark and RDDs
Fast, Interactive, Language-Integrated Cluster Computing
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

spark

Outline Motivating Example History Spark Stack MapReduce Revisited Linear Regression Using MapReduce Iterative and Interactive Procedures in Hadoop In-Memory Storage Resilient Distributed Datasets

Outline Operations on RDDs Transformations Actions Lazy Evaluation Lineage Graph Distributed Execution Caching Broadcast Variables Logistic Regression: Implementation and Performance

Motivating Example Daytona Gray Sort benchmarking challenge. Processed 100 terabytes of data on solid-state drive Time Taken: 23 Minutes Previous winner used Hadoop: 72 minutes This was static dataset. Even higher performance for interactive jobs

History Started in 2009 at AMPLab at the University of California, Berkeley Idea to build a cluster computing system, different from existing ones (Hadoop) Focused on interactive and iterative computations: cases where Hadoop is slow. Open Sourced in 2010 under BSD License In 2013, the project was donated to the Apache Software Foundation. License Apache 2.0

Spark Stack Spark Core Spark SQL Spark Streaming MLib GraphX Standalone Scheduler YARN Mesos

Spark Stack Spark Core Spark SQL Spark Streaming MLib GraphX Standalone Scheduler YARN Mesos

Spark Stack Spark Core Spark SQL Spark Streaming MLib GraphX Standalone Scheduler YARN Mesos

Spark Stack Spark Core Spark SQL Spark Streaming MLib GraphX Standalone Scheduler YARN Mesos

Spark Stack Spark Core Spark SQL Spark Streaming MLib GraphX Standalone Scheduler YARN Mesos

Spark Stack Spark Core Spark SQL Spark Streaming MLib GraphX Standalone Scheduler YARN Mesos

Spark Stack Spark Core Spark SQL Spark Streaming MLib GraphX Standalone Scheduler YARN Mesos

MapReduce Revisited Hadoop is based on acyclic data flow from stable storage to stable storage Mapper Reducer Mapper Reducer Mapper Stable Storage Stable Storage

Linear Regression Using MapReduce 𝑌=𝑋𝑤 1 d 1 Y X W = d n n

Linear Regression Using MapReduce 𝑤 = 𝑋 𝑇 𝑋 −1 ( 𝑋 𝑇 𝑌) 𝑤 = 𝑖=1 𝑛 𝑥 𝑖 𝑥 𝑖 𝑇 −1 𝑖=1 𝑛 𝑥 𝑖 𝑦 𝑖 𝑚 1 𝑚 2

Linear Regression Using MapReduce 𝑤 = 𝑖=1 𝑛 𝑥 𝑖 𝑥 𝑖 𝑇 −1 𝑖=1 𝑛 𝑥 𝑖 𝑦 𝑖 Map (record) { x = getXFromRecord(record); emit (“xx”, x * transpose(x)); } Reduce (key, value) { emit (key, sum(value)); } Mapper 1 Reducer 1

Linear Regression Using MapReduce 𝑤 = 𝑖=1 𝑛 𝑥 𝑖 𝑥 𝑖 𝑇 −1 𝑖=1 𝑛 𝑥 𝑖 𝑦 𝑖 Map (record) { x = getXFromRecord(record); y = getYFromRecord(record); emit (“xy”, x * y); } Reduce (key, value) { emit (key, sum(value)); } Mapper 2 Reducer 2

Iterative Procedure in Hadoop Iteration 1 MR1 HDFS read HDFS write Data on Disk Data on Disk MR2 MR3

Iterative Procedure in Hadoop Iteration 1 Iteration 2 MR1 MR1 HDFS read HDFS write HDFS read HDFS write Data on Disk Data on Disk Data on Disk MR2 MR2 MR3 MR3

Iterative Procedure in Hadoop Iteration 1 Iteration 2 MR1 MR1 HDFS read HDFS write HDFS read HDFS write Data on Disk Data on Disk Data on Disk MR2 MR2 MR3 MR3

Interactive Procedure in Hadoop Query 1 Result 1 HDFS read Data on Disk Query 2 Result 1 Query 3 Result 1

Interactive Procedure in Hadoop Query 1 Result 1 HDFS read Data on Disk Query 2 Result 1 Query 3 Result 1

Solution: Keep Working Set in Memory (Iterative Algorithms) Iteration 1 Iteration 1 Iteration n MR1 MR1 MR1 HDFS read write read write read HDFS write Data on Disk Distributed Memory Distributed Memory Output Data on Disk MR2 MR2 MR2 MR3 MR3 MR3

Solution: Keep Working Set in Memory (Interactive Procedures) Querry 1 Result 1 Data on Disk Distributed Memory Querry 2 Result 1 Querry 3 Result 1

Challenge How to design a distributed memory abstraction that is both efficient and fault tolerant?

Resilient Distributed Datasets (RDDs) A distributed collection of objects. In Spark, all work is done using RDDs. Spark (Core) automatically distributes the data contained in RDDs across the cluster. The operations performed on RDDs are parallelized across the distributions.

Properties of RDDs In-Memory Immutability Lazy Evaluated Parallelized Partitioned Support any type of Scala, Java or Python objects, including user defined ones.

Creating RDDs From a file: Using paralellize() rdd = sc.textFile(“file.txt”) Using paralellize() rdd = sc.parallelize([1, 2, 3])

Types of Operations Transformations: Lazy operations that return another RDD. Actions: operations that trigger computation and return values.

The operations are applied on an RDD with the contents {3, 4, 1, 3} Transformations The operations are applied on an RDD with the contents {3, 4, 1, 3} Name Description Syntax Result map Returns a new RDD formed by passing each element of the source through a function. rdd.map(lambda x: x + 1) {4, 5, 2, 4} filter Returns a new RDD formed by selecting those elements of the source on which function returns true. rdd.filter(lambda x: x % 3 != 0) {4, 1} flatMap Applies a function to each element in the RDD and returns an RDD of the contents of the iterators returned. rdd.flatMap(lambda x: range(x)) {0, 1, 2, 0, 1, 2, 3, 0, 0, 1, 2} distinct Returns a new RDD that contains the distinct elements of the source dataset. rdd.distinct() {3, 4, 1}

Transformations The operations are applied on an RDD with the contents {(“a”, 2), (“b”, 1), (“a”, 1), (“c”, 2), (“b”, 3), (“a”, 2), (“c”, 4)} Name Description Syntax Result groupByKey When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.  rdd.groupByKey() {(“a”, {2, 1, 2}), (“b”, {1, 3}), (“c”, {2, 4})} reduceByKey When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduction function. rdd.reduceByKey( lambda a, b: a + b) {(“a”, 5), (“b”, 4), (“c”, 6)}

Transformations The operations are applied on an RDD with the contents {1, 2, 3} and {2, 4, 5} Name Description Syntax Result union Returns a new RDD that contains the union of the elements in the source dataset and the argument. first.union(second) {1, 2, 3, 2, 4, 5} intersection Returns a new RDD that contains the intersection of elements in the source dataset and the argument. first.intersection(second) {2} subtract Returns a new RDD that contains the elements obtained after subtraction. first.subtract(second) {1, 3} cartesian When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). first.cartesian(second) {(1, 2), (1, 4), (1, 5), (2, 2), (2, 4), (2, 5), (3, 2), (3, 4), (3, 5)}

The operations are applied on an RDD with the contents {3, 4, 1, 3} Actions The operations are applied on an RDD with the contents {3, 4, 1, 3} Name Description Syntax Result reduce Aggregates the elements of the RDD using a function rdd.reduce(lambda a, b: a * b) 36 collect Returns all the elements of the RDD as an array at the driver program. rdd.collect() [3, 4, 1, 3] count Returns the number of elements in the RDD. rdd.count() 4 first Returns the first element of the RDD.  rdd.first() 3 take Returns an array with the first n elements of the dataset. rdd.take(2) [3, 4]

Question There is a function defined as: def operation(rdd):     result = rdd.map(lambda x: (x,1.0)).reduce( lambda a, b : (a[0] + b[0], a[1] + b[1]))     return result If the input to the function ‘operation’ is the RDD {1, 2, 1}, what is the value returned by it?  a. 4, 1.0 b. 3, 3.0 c. 4, 3.0 d. 3, 1.0

Example: Word Count def count_words(some_file) rdd = sc.textFile(some_file) return rdd.flatMap(lambda x: x.split()) .map(lambda x: (x, 1)) .reduceByKey(lambda a, b: a + b)

Lazy Evaluation Computation and loading of data happens only when action is called Spark internally records metadata to indicate that this operation has been requested. A quick example: To find first line containing the word ‘python’: sc.textFile("UsePython.txt") pythonLines = lines.filter(lambda line: "Python" in line) pythonLines.first() # This is where Spark loads # data and carries out the operations

Question 2 1 3 4 lens = data.map(lambda x: len(x)) data = sc.textfile(‘text.txt’) 3 4 addup = lens.reduce(lambda a, b: a+b) addup .collect() Given the flow above, at which point in time is the RDD ‘data’ loaded in memory? 1 2 3 4

Lineage Graph Create RDD from File inputRdd map filter lengthsRdd subsetRdd map reduceByKey squaresRdd pairsRdd distinct distinctRdd

Lineage Graph Create RDD from File inputRdd map filter lengthsRdd subsetRdd map reduceByKey squaresRdd Corrupted pairsRdd distinct distinctRdd

Lineage Graph Create RDD from File inputRdd map filter lengthsRdd subsetRdd map reduceByKey squaresRdd pairsRdd distinct distinctRdd

Components for Distributed Execution Worker Node Executor Task Driver Program Spark Context Worker Node Executor Task

Caching We might want to use the same RDD multiple times. Spark will re-compute the RDD and all of its dependencies each time we call an action on the RDD. For example inputRdd = sc.textFile(“file.txt”) lengths = inputRdd.map(lambda x: len(x)) squared = lengths.map(lambda x: x * x) squares.reduce(lambda a, b: a + b) squares.count() Expensive for iterative algorithms. If we wish to use an RDD multiple times we can cache it.

Caching Worker Block 1 Driver Cache task Worker Block 2 Cache Worker inputRdd = sc.textFile(“file.txt”) lengths = inputRdd.map(lambda x: len(x)) squared = lengths.map(lambda x: x * x) cachedSq = squared.cache() squares.reduce(lambda a, b: a + b) squares.count() Block 1 Driver Cache task Worker Block 2 Cache Worker Block 3 Cache

Broadcast Variables Broadcast Variables allow the program to efficiently send a large, read-only value to all the worker nodes. baseRdd = sc.textFile(“text.txt”) def lookup(value, table): return table[value] lookupTable = loadLookupTable() baseRdd.map(lambda x: lookup(x, lookupTable)) .reduce(lambda a, b: a + b)

Broadcast Variables . baseRdd = sc.textFile(“text.txt”) def lookup(value, table): return table[value] lookupTable = spark.broadcast(loadLookupTable()) baseRdd.map(lambda x: lookup(x, lookupTable)) .reduce(lambda a, b: a + b)

Example: Logistic Regression 𝑤 𝑗 ≔ 𝑤 𝑗 − 𝛼 𝑖=1 𝑛 (ℎ 𝑤 ( 𝑥 𝑖 ) − 𝑦 𝑖 ) 𝑥 𝑗 (𝑖) ℎ 𝑤 𝑥 = 1 1+ 𝑒 − 𝑤 𝑇 𝑥 val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for(i <- 1 to ITERATIONS) { val gradient = data.map(p => ((1 / (1 + exp((w dot p.x)))) - p.y) * p.x ).reduce((a, b) => a + b) w -= ALPHA * gradient } println("w: " + w)

Logistic Regression Performance 127s / iteration First Iteration: 174s Further Iterations: 6s

THANK YOU