SparkR: Enabling Interactive Data Science at Scale

Slides:



Advertisements
Similar presentations
Machine Learning on Spark
Advertisements

Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Introduction to Spark Internals
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Spark Performance Patrick Wendell Databricks.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark 1.1 and Beyond Patrick Wendell.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
First steps in SparkR Mikael Huss SciLifeLab / Stockholm University 16 February, 2015.
Introduction to Apache Spark
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015.
© 2015 IBM Corporation UNIT 2: BigData Analytics with Spark and Spark Platforms 1 Shelly Garion IBM Research -- Haifa.
Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica Good morning everyone. My name is Haoyuan,
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Data Engineering How MapReduce Works
Matthew Winter and Ned Shawa
Nov 2006 Google released the paper on BigTable.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext.
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Scaling up R computation with high performance computing resources.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
ORNL is managed by UT-Battelle for the US Department of Energy Spark On Demand Deploying on Rhea Dale Stansberry John Harney Advanced Data and Workflows.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Big Data Analytics and HPC Platforms
COURSE DETAILS SPARK ONLINE TRAINING COURSE CONTENT
Running Apache Spark on HPC clusters
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
ITCS-3190.
Spark.
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Hadoop Tutorials Spark
Spark Presentation.
Data Platform and Analytics Foundational Training
Hadoop Clusters Tess Fulkerson.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman
CMPT 733, SPRING 2016 Jiannan Wang
CS110: Discussion about Spark
Introduction to Apache
An Overview of Apache Spark
Spark and Scala.
Spark and Scala.
Big Data, Bigger Data & Big R Data
Introduction to Spark.
Fast, Interactive, Language-Integrated Cluster Computing
Big-Data Analytics with Azure HDInsight
Lecture 29: Distributed Systems
Presentation transcript:

SparkR: Enabling Interactive Data Science at Scale Hi everyone. My name is Zongheng, and I’m very excited to be here today to talk about SparkR. which is a research project Shivaram and I started at UC Berkeley AMPLab a couple months ago. A little bit about ourselves. I am a third-year undergraduate at berkeley studying computer science and math. I work in the amplab as a research assistant. Shivaram is a third-year CS PhD student who is also part of the UC Berkeley AMPLab. His research interests include topics in distributed computing and machine learning. Ok. Just so I can get a rough impression I’m guessing most of us here have some experience with Spark but how many of us have used or programmed in R before? Ok let’s jump right in to the talk. Shivaram Venkataraman Zongheng Yang

Talk Outline Motivation Overview of Spark & SparkR API Live Demo: Digit Classification Design & Implementation Questions & Answers So much of the intro. Here’s the outline of the talk.

Fast! Scalable Flexible Motivation Before introducing what SparkR is and its functionalities, let me try to explain a little bit of the motivation behind the project. When users choose to use Spark or when we think about Spark’s advantages Usually the first thing that comes to mind is that Spark is very fast. Furthermore we know that being a cluster computing engine, Spark is also scalable. One other advantage is it is flexible – you could use the highly expressive APIs to write concise programs, and you have the choice of writing it in different languages We have seen how these characteristics & other features have made Spark popular.

Statistical! Packages Interactive Now what about R? Firstly, R is amazingly good at statistics and data analysis. In fact it is designed for statisticians and is very popular in related areas. Moreover, there’s an extensive list of mature packages available in R that are very useful & popular. Some examples include the ggplot2 package for plotting sophisticated graphs (where users can get immense control over the layout), or the plyr package for manipulating & transforming data. Additionally, another characteristic is that R fits well with an interactive workflow. For instance you could load your datasets into the R shell, do some explorations, and quickly visualize your findings by plotting various graphs.

Fast! Statistical! Scalable Packages Flexible Interactive However, there’s one drawback: Traditionally, the R internal is single-threaded. It is unclear how R programs can be effectively and concisely written to run on multiple machines. So, what if we can combine these two worlds? This is where SparkR comes in: it is a language binding that lets users write R programs that are equipped with nice statistics packages, and have them run on top of Spark.

RDD Transformations Actions Parallel Collection map filter groupBy … count collect saveAsTextFile RDD Parallel Collection So when we start thinking about what kind of API would SparkR have, we look at Spark’s API first. Most of Spark API consists of operations on the class RDD. As some of you might know, RDD has two kinds of operations, one is called transformations and the other is actions. For instance, the map function is a transformation, and just lets you apply a custom function on all elements in the RDD. Actions are special operations that actually fire off computation, so when you do a saveAsTextFile(), you expect the call immediately starts saving whatever elements you have in the RDD to a text file. Similar for count() and collect(). When we are designing SparkR’s API, one direct approach is to mimic these API functions, but lets you call them inside R programs and perhaps on R datasets.

R + RDD = R2D2 Let me present the result of adding R and RDD, for those of the Star Wars fans out there, the correct answer is obviously R2D2 the robot!

R + RDD = RRDD lapply lapplyPartition groupByKey reduceByKey sampleRDD collect cache … broadcast includePackage textFile parallelize R + RDD = RRDD Actually, we came up with something cooler, which is RRDD. so RRDD is a subclass of RDD that facilitates calling some of the familiar functions from inside R. Furthermore, we include some aliases on some of the functions, which respects the idiomatic practice in R. For instance, the RDD map() function is still available in SparkR, but we also provide an alias called lapply(). This is because a native lapply() call in R simply loops over a list of data, and an RRDD is conceptually a list of elements so we chose this name. Besides supporting many of the essential RDD operations – such as the transformations here and Spark’s broadcast support, SparkR also includes some new feature that attempts to fulfill our design motivation. For instance, we have introduced this includePackage() function, that simply takes a package name, and mark it as included in the current environment of all worker nodes running SparkR. Using this function, users can use the functions in some nice, existing R package, or his own UDFs in the closures of say, lapply() or groupByKey().

Getting Closer to Idiomatic R Q: How can I use a loop to [...insert task here...] ? A: Don’t. Use one of the apply functions. However we also consider the idiomatic practice in R. For instance, it is very usual in say Java to loop over some collections of data, and performing some operations on them. But R actually has a very nice family of functions called apply, such as lapply or sapply, that does similar things. So instead of explicitly using a for loop or a while loop, idiomatic R prefers using these apply functions. And this is actually a quote from this article. The ultimate purpose of this decision, is to remove the learning curve for R programmers as much as possible. From: http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/

Example: Word Count lines <- textFile(sc, "hdfs://my_text_file") So much for the overview of SparkR’s current API. Let’s see how it works in action with an example. Of course, we are now tackling the most important problem in distributed computing, Word Count. I am going to walk through this very short R program line by line, and explain how it uses SparkR’s API as we go. In this program, the things marked in red color are either SparkR or Spark concepts. The first step of the world count program is to read data from HDFS into RDD. For this we use the textFile function, whose first argument just takes in a spark context sc, and also a path to a HDFS location. By the way if you start sparkR in the native R shell, we will automatically create this variable sc for you, just like the Spark shell does. This way, lines is conceptually an RDD of string, ready to be further operated on inside R. For those of you who are already familiar with Spark, you will probably notice that this is very similar to the counterpart in original API.

Example: Word Count lines <- textFile(sc, "hdfs://my_text_file") words <- flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <- lapply(words, function(word) { list(word, 1L) - serialize closures Next step, we extract the actual words from each line. To do this, we use the Spark flatMap function, and feed the lines RDD and a closure into it. This closure uses this R function called strsplit, which just splits the line by a space and takes the first part. The third step uses the SparkR lapply() function, which is an alias for map(), that just maps over this words RDD, and for each word produces a pair. So how does this actually get executed? Under the hood, SparkR will automatically fetch the dependencies of each closure, and serialize the whole thing for you. It then gets shipped over the network to every worker during a shuffle.

Example: Word Count lines <- textFile(sc, "hdfs://my_text_file") words <- flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <- lapply(words, function(word) { list(word, 1L) counts <- reduceByKey(wordCount, "+", 2L) output <- collect(counts) - Use R primitive functions like “+” To finish the word count program, we next call the Spark reduceByKey function, that takes the previous key-value pair rdd, and an R primitive function which is represented by the plus here, and a number representing the number of partitions to use. Lastly, we call the collect() function on this counts RDD, getting back the final answer in an array.

Live Demo As the next part of the remaining talk, I want to do a live demo that presents some of the advantages of SparkR. Specifically, we will tackle a machine learning problem in R, but make our programs run faster by executing it on Spark.

MNIST The machine learning problem is digit recognition using this dataset MNIST. MNIST is actually very widely used and researched in the machine learning community, and is basically a set of images that are hand-written digits. The problem is to train a machine learning model, that recognizes the actual digits from these images.

A b Minimize Here’s one formalization of this problem Basically, our high-level plan is to extract a feature matrix A from the input dataset, as well as label vector b here. The goal is to find the vector x here, such that this norm here is minimized. Therefore the program will just compute ATA and ATb, and then solve for x. [BEGIN DEMO] Minimize

How does this work ? [END DEMO] Hopefully you agree that this all seems pretty cool. As the next part of the talk, I’d like to discuss some of the details of SparkR’s design & implementation. Namely how we went about implementing all of this functionality under the hood. How does this work ?

Dataflow Local Worker Worker The core internal of SparkR can be explained by an illustration on the dataflow of a computation in a SparkR job. Let’s consider a simple scenario where you launch SparkR from a local machine, and the cluster contains two workers.

Dataflow R Local Worker Worker The first thing you do is to launch a normal R process, such as the R shell.

Dataflow R Local Worker Worker JNI Java Spark Context Spark Context The next step is to launch SparkR, by calling library(sparkR) from within that R process. What this will do is that SparkR will use JNI to create a JavaSparkContext, and hold on to a reference to it inside R. For instance this reference is accessible in the sparkR shell called sc, as we have seen before.

Dataflow R R R Local Worker Worker exec JNI exec Spark Executor Java Spark Context Java Spark Context exec JNI Worker When an action actually takes place, the JavaSparkContext in the JVM will instruct the worker nodes to launch Spark executors, again inside their own JVMs. Then, each spark executor will fork off a new thread that serves as an R worker. And this R worker takes care of reloading the deserialized tasks, and the list of R packages to include locally, as well as broadcast variables and so forth. The actual computation happens in this R worker process. And the results are communicated back to the spark executor, which in turn communicates back to the driver machine. Spark Executor R exec

There is one thing that is of particular interest here, which is the pipes we use for the communication between the R worker processes and the executors running on JVM. There are two parts of this communication. One is the way we grab the dependencies of a closure (or in other words, an anonymous function), and the other part of the picture is how do we serialize and deserialize these functions and dependencies.

From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/ The way we attack the first aspect is of course by traversing the old, lovely environment diagrams. Basically, in R, these kinds of environment objects store the mappings between variables names and their values. Also, different environments are chained together by this parent relationship, as defined by the lexical scoping semantics. So let’s say if a closure uses as variable that is not defined in its own environment, we basically keep walking up this environment chain, and grab the first value found. From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/

The second aspect of the communication issue is, the way we communicate data Basically, we use this pretty robust native serialization function provided in R. Which is called save. It is particularly handy and mature, and in fact, each time you finish playing around with an R shell and when you try to exit, you will get a prompt asking you whether or not to save the session – and this save() function does all the hard work of serializing all kinds of R objects to disks. In SparkR’s case, we feed the objects we want to serialize into this function, and get back a byte array as result, then we ship this byte array across the network.

Dataflow: Performance? Local Worker Spark Executor R R Spark Context Java Spark Context exec JNI Worker Hopefully this explains the seemingly complicated communication process as highlighted here. A natural follow-up question to ask is, what if there are multiple transformation on an RDD? In Spark, transformations don’t fire off computation jobs because they have a lazy semantics. When an action takes place, all previous uncomputed transformation functions are combined together into one giant function to ship over the network. In SparkR we’d want to follow the same semantics, one reason is to maintain the similarity with Spark’s other APIs, and the other reason is to avoid the cost of doing the aforementioned pretty costly process every time a transformation is called on an RDD. Spark Executor R exec

…Pipeline the transformations! words <- flatMap(lines, …) wordCount <- lapply(words, …) Spark Executor exec R SparkR’s solution is to introduce a pipelined RDD that does exactly that optimization. It basically combines all R functions in a series of transformations toghether, and ship them altogether once, so that spark executors do not need to fork off new R processes for every transformation. This is one optimization that we currently do.

Alpha developer release One line install! install_github("amplab-extras/SparkR-pkg", subdir="pkg")

SparkR Implementation Lightweight 292 lines of Scala code 1694 lines of R code 549 lines of test code in R …Spark is easy to extend!

In the Roadmap Calling MLLib directly within SparkR Data Frame support Better integration with R packages Performance: daemon R processes Speaking of extension….

EC2 setup scripts All Spark examples MNIST demo Hadoop2, Maven build On Github

Combine scalability & utility RDD :: distributed lists Closures & Serializations Re-use R packages SparkR Combine scalability & utility

Thanks! https://github.com/amplab-extras/SparkR-pkg Shivaram Venkataraman shivaram@cs.berkeley.edu Zongheng Yang zongheng.y@gmail.com Spark User mailing list user@spark.apache.org

Pipelined RDD R R R R exec exec exec Spark Spark Executor Executor Here’s an illustration on the effects. As we can Spark Executor R R Spark Executor exec

HDFS / HBase / Cassandra / … SparkR Processing Engine Spark Cluster Manager Mesos / YARN / … Storage HDFS / HBase / Cassandra / … Let’s take a high-level look at SparkR. This is a graph that shows a common Spark stack. At the top, you have the processing execution engine which is Spark. Optionally, you can have a third-part cluster manager, such as Mesos or YARN, which basically just manages tasks across all the workers in a cluster. At the bottom is the Storage layer. Spark supports reading from some popular data formats & data sources, such as HDFS, Hbase, Cassandra. So where does SparkR fit into this picture? Here. SparkR lets users write R programs, and provides an interface into Spark. In other words, your familiar R programs can be run on Spark, utilizing the power of cluster computing plus retaining the aforementioned benefits of R.

Example: Logistic Regression pointsRDD <- textFile(sc, "hdfs://myfile") weights <- runif(n=D, min = -1, max = 1) # Logistic gradient gradient <- function(partition) { X <- partition[,1]; Y <- partition[,-1] t(X) %*% (1/(1 + exp(-Y * (X %*% weights))) - 1) * Y }

Example: Logistic Regression pointsRDD <- textFile(sc, "hdfs://myfile") weights <- runif(n=D, min = -1, max = 1) # Logistic gradient gradient <- function(partition) { X <- partition[,1]; Y <- partition[,-1] t(X) %*% (1/(1 + exp(-Y * (X %*% weights))) - 1) * Y } # Iterate weights <- weights - reduce( lapplyPartition(pointsRDD, gradient), "+") Write jobs in R. Use R shell. Support R packages

How does it work ? RScript RScript Spark Executor Spark Executor Data: RDD[Array[Byte]] Spark Context Functions: Array[Byte] rJava R Shell