Spark with R Martijn Tennekes THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Spark and R: best of both worlds Apache Spark Cluster-computing framework Supports SQL syntax Built-in machine learning High performance R Many tools and packages for data analysis Interactive documents Powerful visualizations Interactive dashboards
Using Spark in R RStudio Spark objects are shown in the Connections window in the same way as local R objects are shown in the Environment window. Button to access the Spark Web Console R, package sparklyr A dplyr back-end for Spark Extendable, e.g. rsparkling for H2O (machine learning using Spark) Easy to configure a Spark cluster, or to set up a local Spark cluster
Data science workflow Source: R for Data Science - Garrett Grolemund, Hadley Wickham
Import data Connect to Spark sc <- spark_connect() Import data From csv file x_tbl <- spark_read_csv(sc, "x.csv") From json x_tbl <- spark_read_json(sc, "x.json") From parquet (HDFS) x_tbl <- spark_read_parquet(sc, "name", path) From R object x_tbl <- copy_to(sc, x) The returned object (x_tbl) is a tbl_spark object, equivalant to a normal tbl (data.frame) Retrieving Spark tables Get table names src_tbls(sc) Get tbl_spark object x_tbl <- tbl(sc, "table")
Process data Process data with dplyr functions, e.g. SQL Description select SELECT Select columns filter WHERE Filter rows arrange ORDER Order rows summarise AVG, SUM, MIN, etc. Aggregate columns mutate +, *, LOG, etc. Create new columns Process data with dplyr functions, e.g. filter to filter rows select to select columns mutate to create new variables group_by to specify the groups summarize to summarize the values left_join to join tables
Process data Process data with the main dplyr functions: dplyr SQL Description select SELECT Select columns filter WHERE Filter rows arrange ORDER Order rows mutate +, *, LOG, etc. Create new columns group_by GROUP BY Group rows summarise AVG, SUM, MIN, etc. Aggregate columns Recommendation: use the pipe operator %>% tmp <- functionA(x, paramA1, paramA2) y <- x %>% y <- functionB(tmp, paramB) functionA(paramA1, paramA2) %>% functionB(paramB)
Visualize data Approach: Process the data such that the output is exactly that is needed for the plot Use collect() to copy this data into R’s memory Use general purpose R packages to create the plot: ggplot2 for any type of plot (line chart, bar chart, scatter plot, heatmap, etc) tmap for (interactive) maps
Model data Any R package H2O Spark MLlib Many implemented methods (over 14000 CRAN packages) Downside: data needs to be collected into R’s memory (see previous slide). H2O Open-source software for big data analysis (statistical methods / machine learning) Integration with Spark via Sparkling Water Sparkling Water is accessable with the R package rsparkling Spark MLlib Native machine learning algortihms All functions have the prefix ml_ for instance ml_random_forest Functions tidy, augment, and glance from the broom package are implemented for these methods. These functions summarize the method output at different levels. The format is always a data.frame.
Communicate results R Markdown Shiny Markdown is a lightweight and easy to use markup language With R Markdown, chunks of R code can be embedded. Each chunk can either be shown or hidden in the document. Also, the output of the chunk can either be shown or hidden. Interactive visualizations can be included as well Shiny Interactive dashboards Can be run locally, or hosted on a server Free servers available, i.e. shinyapps.io
Overview of sparklyr functions Function naming: Feature transformation functions: prefix ft_ Machine learning functinos (Spark MLlib): prefix ml_ Spark DataFrame functions*: prefix sdf_ Configuration / read / write functions: prefix spark_ Streaming data functions: prefix stream_ * Note that the functions can often be provided without prefix. For instance, by using sdf_copy_to on a Spark data.frame, copy_to is used.
References R for Data Science - Garrett Grolemund and Hadley Wickham https://r4ds.had.co.nz/ sparklyr: R interface for Apache Spark – Rstudio https://spark.rstudio.com/