Spark with R Martijn Tennekes

Slides:



Advertisements
Similar presentations
Data Mining with R/ORE Minming Duan. 2 iTech Solution Profile Agenda R/ORE Overview 1 XML output generation using SQL 4 Integration with IBP and BIEE.
Advertisements

Chapter 12: ADO.NET and ASP.NET Programming with Microsoft Visual Basic.NET, Second Edition.
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
Computer Science 101 Web Access to Databases SQL – Extended Form.
Intro to dot Net Dr. John Abraham UTPA – Fall 09 CSCI 3327.
SQL Server Reporting Services London Database Developer Forum Anoop Patel.
Is Apache CouchDB for you?
Microsoft Office 2007 Intermediate© 2008 Pearson Prentice Hall1 PowerPoint Presentation to Accompany GO! With Microsoft ® Office 2007 Intermediate Chapter.
Database control Introduction. The Database control is a tool that used by the database administrator to control the database. To enter to Database control.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
R PROGRAMMING FOR SQL DEVELOPERS Kiran Math Developer : Proterra in Greenville SC
ConTZole Tomáš Kubeš, 2010 atlas-tz-monitoring.cern.ch An Interactive ATLAS Tier-0 Monitoring.
Hydroinformatics Lecture 15: HydroServer and HydroServer Lite The CUAHSI HIS is Supported by NSF Grant# EAR CUAHSI HIS Sharing hydrologic data.
Introduction to R and Data Science Tools in the Microsoft Stack Jamey Johnston.
Introduction to Algorithm. What is Algorithm? an algorithm is any well-defined computational procedure that takes some value, or set of values, as input.
Data Visualization with Tableau
Architecture and design
Tidy data, wrangling, and pipelines in R
Business rules.
Microsoft Visual Basic 2010: Reloaded Fourth Edition
Big Data is a Big Deal!.
Introduction to Spark Streaming for Real Time data analysis
Hadoop.
21 Essential Data Visualization Tools
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Web-based Tools for Integrative Analysis of Pancreatic Cancer Data
Spark Presentation.
R For The SQL Developer Kevin Feasel Manager, Predictive Analytics
Cameron Blashka| Informer Implementation Specialist
Next Generation R tidyr, dplyr, ggplot2
Modeling Knowledge Sharing: PART Freight Model Dashboard
Deploying and Configuring SSIS Packages
Open Source on .NET A real world use case.
Data Wrangling in the Tidyverse
Data manipulation in R: dplyr
Lesson 1: Introduction to Trifacta Wrangler
Dplyr I EPID 799C Mon Sep
Tutorial 8 Objectives Continue presenting methods to import data into Access, export data from Access, link applications with data stored in Access, and.
Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo
Oracle Analytic Views Enhance BI Applications and Simplify Development
What Power BI users need to know about R
Lesson 1: Introduction to Trifacta Wrangler
ETL – Using R Kiran Math Developer : Flour in Greenville SC
Introduction to R.
Lesson 1: Introduction to Trifacta Wrangler
Oracle Architecture Overview
Enhance BI Applications and Simplify Development
A Gentle Introduction to R from a SAS Programmer’s Perspective
Welcome to E-Prime E-Prime refers to the Experimenter’s Prime (best) development studio for the creation of computerized behavioral research. E-Prime is.
Lesson 1 – Chapter 1B Chapter 1B – Terminology
CS110: Discussion about Spark
Thank you Sponsors.
Introduction to Apache
Exploring the Power of EPDM Tasks - Working with and Developing Tasks in EPDM By: Marc Young XLM Solutions
Overview of big data tools
Tidy data, wrangling, and pipelines in R
CSE 491/891 Lecture 21 (Pig).
Query Language (Definition)
Tutorial 7 – Integrating Access With the Web and With Other Programs
Predictive Models with SQL Server Machine Learning Services
Donald Donais Minnesota SharePoint Users Group – April 2019
The Student’s Guide to Apache Spark
Igor Stančin, Alan Jović to: {igor.stancin,
Visual Data Flows – Azure Data Factory v2
Visual Data Flows – Azure Data Factory v2
Integrated Statistical Production System WITH GSBPM
Login Main Functions Via SAS Information Delivery Portal
Presentation transcript:

Spark with R Martijn Tennekes THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Spark and R: best of both worlds Apache Spark Cluster-computing framework Supports SQL syntax Built-in machine learning High performance R Many tools and packages for data analysis Interactive documents Powerful visualizations Interactive dashboards

Using Spark in R RStudio Spark objects are shown in the Connections window in the same way as local R objects are shown in the Environment window. Button to access the Spark Web Console R, package sparklyr A dplyr back-end for Spark Extendable, e.g. rsparkling for H2O (machine learning using Spark) Easy to configure a Spark cluster, or to set up a local Spark cluster

Data science workflow Source: R for Data Science - Garrett Grolemund, Hadley Wickham

Import data Connect to Spark sc <- spark_connect() Import data From csv file x_tbl <- spark_read_csv(sc, "x.csv") From json x_tbl <- spark_read_json(sc, "x.json") From parquet (HDFS) x_tbl <- spark_read_parquet(sc, "name", path) From R object x_tbl <- copy_to(sc, x) The returned object (x_tbl) is a tbl_spark object, equivalant to a normal tbl (data.frame) Retrieving Spark tables Get table names src_tbls(sc) Get tbl_spark object x_tbl <- tbl(sc, "table")

Process data Process data with dplyr functions, e.g. SQL Description select SELECT Select columns filter WHERE Filter rows arrange ORDER Order rows summarise AVG, SUM, MIN, etc. Aggregate columns mutate +, *, LOG, etc. Create new columns Process data with dplyr functions, e.g. filter to filter rows select to select columns mutate to create new variables group_by to specify the groups summarize to summarize the values left_join to join tables

Process data Process data with the main dplyr functions: dplyr SQL Description select SELECT Select columns filter WHERE Filter rows arrange ORDER Order rows mutate +, *, LOG, etc. Create new columns group_by GROUP BY Group rows summarise AVG, SUM, MIN, etc. Aggregate columns Recommendation: use the pipe operator %>% tmp <- functionA(x, paramA1, paramA2) y <- x %>% y <- functionB(tmp, paramB) functionA(paramA1, paramA2) %>% functionB(paramB)

Visualize data Approach: Process the data such that the output is exactly that is needed for the plot Use collect() to copy this data into R’s memory Use general purpose R packages to create the plot: ggplot2 for any type of plot (line chart, bar chart, scatter plot, heatmap, etc) tmap for (interactive) maps

Model data Any R package H2O Spark MLlib Many implemented methods (over 14000 CRAN packages) Downside: data needs to be collected into R’s memory (see previous slide). H2O Open-source software for big data analysis (statistical methods / machine learning) Integration with Spark via Sparkling Water Sparkling Water is accessable with the R package rsparkling Spark MLlib Native machine learning algortihms All functions have the prefix ml_ for instance ml_random_forest Functions tidy, augment, and glance from the broom package are implemented for these methods. These functions summarize the method output at different levels. The format is always a data.frame.

Communicate results R Markdown Shiny Markdown is a lightweight and easy to use markup language With R Markdown, chunks of R code can be embedded. Each chunk can either be shown or hidden in the document. Also, the output of the chunk can either be shown or hidden. Interactive visualizations can be included as well Shiny Interactive dashboards Can be run locally, or hosted on a server Free servers available, i.e. shinyapps.io

Overview of sparklyr functions Function naming: Feature transformation functions: prefix ft_ Machine learning functinos (Spark MLlib): prefix ml_ Spark DataFrame functions*: prefix sdf_ Configuration / read / write functions: prefix spark_ Streaming data functions: prefix stream_ * Note that the functions can often be provided without prefix. For instance, by using sdf_copy_to on a Spark data.frame, copy_to is used.

References R for Data Science - Garrett Grolemund and Hadley Wickham https://r4ds.had.co.nz/ sparklyr: R interface for Apache Spark – Rstudio https://spark.rstudio.com/