spark
Outline Motivating Example History Spark Stack MapReduce Revisited Linear Regression Using MapReduce Iterative and Interactive Procedures in Hadoop In-Memory Storage Resilient Distributed Datasets
Outline Operations on RDDs Transformations Actions Lazy Evaluation Lineage Graph Distributed Execution Caching Broadcast Variables Logistic Regression: Implementation and Performance
Motivating Example Daytona Gray Sort benchmarking challenge. Processed 100 terabytes of data on solid-state drive Time Taken: 23 Minutes Previous winner used Hadoop: 72 minutes This was static dataset. Even higher performance for interactive jobs
History Started in 2009 at AMPLab at the University of California, Berkeley Idea to build a cluster computing system, different from existing ones (Hadoop) Focused on interactive and iterative computations: cases where Hadoop is slow. Open Sourced in 2010 under BSD License In 2013, the project was donated to the Apache Software Foundation. License Apache 2.0
Spark Stack Spark Core Spark SQL Spark Streaming MLib GraphX Standalone Scheduler YARN Mesos
Spark Stack Spark Core Spark SQL Spark Streaming MLib GraphX Standalone Scheduler YARN Mesos
Spark Stack Spark Core Spark SQL Spark Streaming MLib GraphX Standalone Scheduler YARN Mesos
Spark Stack Spark Core Spark SQL Spark Streaming MLib GraphX Standalone Scheduler YARN Mesos
Spark Stack Spark Core Spark SQL Spark Streaming MLib GraphX Standalone Scheduler YARN Mesos
Spark Stack Spark Core Spark SQL Spark Streaming MLib GraphX Standalone Scheduler YARN Mesos
Spark Stack Spark Core Spark SQL Spark Streaming MLib GraphX Standalone Scheduler YARN Mesos
MapReduce Revisited Hadoop is based on acyclic data flow from stable storage to stable storage Mapper Reducer Mapper Reducer Mapper Stable Storage Stable Storage
Linear Regression Using MapReduce 𝑌=𝑋𝑤 1 d 1 Y X W = d n n
Linear Regression Using MapReduce 𝑤 = 𝑋 𝑇 𝑋 −1 ( 𝑋 𝑇 𝑌) 𝑤 = 𝑖=1 𝑛 𝑥 𝑖 𝑥 𝑖 𝑇 −1 𝑖=1 𝑛 𝑥 𝑖 𝑦 𝑖 𝑚 1 𝑚 2
Linear Regression Using MapReduce 𝑤 = 𝑖=1 𝑛 𝑥 𝑖 𝑥 𝑖 𝑇 −1 𝑖=1 𝑛 𝑥 𝑖 𝑦 𝑖 Map (record) { x = getXFromRecord(record); emit (“xx”, x * transpose(x)); } Reduce (key, value) { emit (key, sum(value)); } Mapper 1 Reducer 1
Linear Regression Using MapReduce 𝑤 = 𝑖=1 𝑛 𝑥 𝑖 𝑥 𝑖 𝑇 −1 𝑖=1 𝑛 𝑥 𝑖 𝑦 𝑖 Map (record) { x = getXFromRecord(record); y = getYFromRecord(record); emit (“xy”, x * y); } Reduce (key, value) { emit (key, sum(value)); } Mapper 2 Reducer 2
Iterative Procedure in Hadoop Iteration 1 MR1 HDFS read HDFS write Data on Disk Data on Disk MR2 MR3
Iterative Procedure in Hadoop Iteration 1 Iteration 2 MR1 MR1 HDFS read HDFS write HDFS read HDFS write Data on Disk Data on Disk Data on Disk MR2 MR2 MR3 MR3
Iterative Procedure in Hadoop Iteration 1 Iteration 2 MR1 MR1 HDFS read HDFS write HDFS read HDFS write Data on Disk Data on Disk Data on Disk MR2 MR2 MR3 MR3
Interactive Procedure in Hadoop Query 1 Result 1 HDFS read Data on Disk Query 2 Result 1 Query 3 Result 1
Interactive Procedure in Hadoop Query 1 Result 1 HDFS read Data on Disk Query 2 Result 1 Query 3 Result 1
Solution: Keep Working Set in Memory (Iterative Algorithms) Iteration 1 Iteration 1 Iteration n MR1 MR1 MR1 HDFS read write read write read HDFS write Data on Disk Distributed Memory Distributed Memory Output Data on Disk MR2 MR2 MR2 MR3 MR3 MR3
Solution: Keep Working Set in Memory (Interactive Procedures) Querry 1 Result 1 Data on Disk Distributed Memory Querry 2 Result 1 Querry 3 Result 1
Challenge How to design a distributed memory abstraction that is both efficient and fault tolerant?
Resilient Distributed Datasets (RDDs) A distributed collection of objects. In Spark, all work is done using RDDs. Spark (Core) automatically distributes the data contained in RDDs across the cluster. The operations performed on RDDs are parallelized across the distributions.
Properties of RDDs In-Memory Immutability Lazy Evaluated Parallelized Partitioned Support any type of Scala, Java or Python objects, including user defined ones.
Creating RDDs From a file: Using paralellize() rdd = sc.textFile(“file.txt”) Using paralellize() rdd = sc.parallelize([1, 2, 3])
Types of Operations Transformations: Lazy operations that return another RDD. Actions: operations that trigger computation and return values.
The operations are applied on an RDD with the contents {3, 4, 1, 3} Transformations The operations are applied on an RDD with the contents {3, 4, 1, 3} Name Description Syntax Result map Returns a new RDD formed by passing each element of the source through a function. rdd.map(lambda x: x + 1) {4, 5, 2, 4} filter Returns a new RDD formed by selecting those elements of the source on which function returns true. rdd.filter(lambda x: x % 3 != 0) {4, 1} flatMap Applies a function to each element in the RDD and returns an RDD of the contents of the iterators returned. rdd.flatMap(lambda x: range(x)) {0, 1, 2, 0, 1, 2, 3, 0, 0, 1, 2} distinct Returns a new RDD that contains the distinct elements of the source dataset. rdd.distinct() {3, 4, 1}
Transformations The operations are applied on an RDD with the contents {(“a”, 2), (“b”, 1), (“a”, 1), (“c”, 2), (“b”, 3), (“a”, 2), (“c”, 4)} Name Description Syntax Result groupByKey When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. rdd.groupByKey() {(“a”, {2, 1, 2}), (“b”, {1, 3}), (“c”, {2, 4})} reduceByKey When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduction function. rdd.reduceByKey( lambda a, b: a + b) {(“a”, 5), (“b”, 4), (“c”, 6)}
Transformations The operations are applied on an RDD with the contents {1, 2, 3} and {2, 4, 5} Name Description Syntax Result union Returns a new RDD that contains the union of the elements in the source dataset and the argument. first.union(second) {1, 2, 3, 2, 4, 5} intersection Returns a new RDD that contains the intersection of elements in the source dataset and the argument. first.intersection(second) {2} subtract Returns a new RDD that contains the elements obtained after subtraction. first.subtract(second) {1, 3} cartesian When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). first.cartesian(second) {(1, 2), (1, 4), (1, 5), (2, 2), (2, 4), (2, 5), (3, 2), (3, 4), (3, 5)}
The operations are applied on an RDD with the contents {3, 4, 1, 3} Actions The operations are applied on an RDD with the contents {3, 4, 1, 3} Name Description Syntax Result reduce Aggregates the elements of the RDD using a function rdd.reduce(lambda a, b: a * b) 36 collect Returns all the elements of the RDD as an array at the driver program. rdd.collect() [3, 4, 1, 3] count Returns the number of elements in the RDD. rdd.count() 4 first Returns the first element of the RDD. rdd.first() 3 take Returns an array with the first n elements of the dataset. rdd.take(2) [3, 4]
Question There is a function defined as: def operation(rdd): result = rdd.map(lambda x: (x,1.0)).reduce( lambda a, b : (a[0] + b[0], a[1] + b[1])) return result If the input to the function ‘operation’ is the RDD {1, 2, 1}, what is the value returned by it? a. 4, 1.0 b. 3, 3.0 c. 4, 3.0 d. 3, 1.0
Example: Word Count def count_words(some_file) rdd = sc.textFile(some_file) return rdd.flatMap(lambda x: x.split()) .map(lambda x: (x, 1)) .reduceByKey(lambda a, b: a + b)
Lazy Evaluation Computation and loading of data happens only when action is called Spark internally records metadata to indicate that this operation has been requested. A quick example: To find first line containing the word ‘python’: sc.textFile("UsePython.txt") pythonLines = lines.filter(lambda line: "Python" in line) pythonLines.first() # This is where Spark loads # data and carries out the operations
Question 2 1 3 4 lens = data.map(lambda x: len(x)) data = sc.textfile(‘text.txt’) 3 4 addup = lens.reduce(lambda a, b: a+b) addup .collect() Given the flow above, at which point in time is the RDD ‘data’ loaded in memory? 1 2 3 4
Lineage Graph Create RDD from File inputRdd map filter lengthsRdd subsetRdd map reduceByKey squaresRdd pairsRdd distinct distinctRdd
Lineage Graph Create RDD from File inputRdd map filter lengthsRdd subsetRdd map reduceByKey squaresRdd Corrupted pairsRdd distinct distinctRdd
Lineage Graph Create RDD from File inputRdd map filter lengthsRdd subsetRdd map reduceByKey squaresRdd pairsRdd distinct distinctRdd
Components for Distributed Execution Worker Node Executor Task Driver Program Spark Context Worker Node Executor Task
Caching We might want to use the same RDD multiple times. Spark will re-compute the RDD and all of its dependencies each time we call an action on the RDD. For example inputRdd = sc.textFile(“file.txt”) lengths = inputRdd.map(lambda x: len(x)) squared = lengths.map(lambda x: x * x) squares.reduce(lambda a, b: a + b) squares.count() Expensive for iterative algorithms. If we wish to use an RDD multiple times we can cache it.
Caching Worker Block 1 Driver Cache task Worker Block 2 Cache Worker inputRdd = sc.textFile(“file.txt”) lengths = inputRdd.map(lambda x: len(x)) squared = lengths.map(lambda x: x * x) cachedSq = squared.cache() squares.reduce(lambda a, b: a + b) squares.count() Block 1 Driver Cache task Worker Block 2 Cache Worker Block 3 Cache
Broadcast Variables Broadcast Variables allow the program to efficiently send a large, read-only value to all the worker nodes. baseRdd = sc.textFile(“text.txt”) def lookup(value, table): return table[value] lookupTable = loadLookupTable() baseRdd.map(lambda x: lookup(x, lookupTable)) .reduce(lambda a, b: a + b)
Broadcast Variables . baseRdd = sc.textFile(“text.txt”) def lookup(value, table): return table[value] lookupTable = spark.broadcast(loadLookupTable()) baseRdd.map(lambda x: lookup(x, lookupTable)) .reduce(lambda a, b: a + b)
Example: Logistic Regression 𝑤 𝑗 ≔ 𝑤 𝑗 − 𝛼 𝑖=1 𝑛 (ℎ 𝑤 ( 𝑥 𝑖 ) − 𝑦 𝑖 ) 𝑥 𝑗 (𝑖) ℎ 𝑤 𝑥 = 1 1+ 𝑒 − 𝑤 𝑇 𝑥 val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for(i <- 1 to ITERATIONS) { val gradient = data.map(p => ((1 / (1 + exp((w dot p.x)))) - p.y) * p.x ).reduce((a, b) => a + b) w -= ALPHA * gradient } println("w: " + w)
Logistic Regression Performance 127s / iteration First Iteration: 174s Further Iterations: 6s
THANK YOU