R Programming For Sql Developers ETL USING R Kiran Math Consultant kiranmath@outlook.com
Excel Data ETL Sql Server Table Motivation
Motivation
Motivation
DEMO MOTIVATion
Installation Comprehensive R Archive Network (CRAN) https://www.cran.r-project.org/ R Studio https://www.rstudio.com/ Installation
R <- Core && R <-packages ggPlot2 sqldf Base Packages rodbc dplyr stringR reshape2 tidyR lubridate R <- Core && R <-packages
zillow
Visualize Model Transform Get & Tidy Transform @hadleywickham
# Define a Variable a <- 25 # Call a Variable a ## [1] 25 # Do something to it a + 10 ## [1] 35 # Create a vector - Numeric x <- c(0.5, 0.6,0.7) ## call it x ## 0.5 0.6 0.7 # Do something to the vector mean(x) ## [1] 0.6 Basics 1 - vector
Functions are blocks of code that allow R to be a modular and facilitate code reuse Funct_name <- function (arg1,arg2, ..){ ### do something } ## Compute the mean of the vector of numbers meanX <- function(a_vector) { s <- sum(a_vector) l <- length(a_vector) m <- s/l return(m) } ### create a vector v <- c(1,2,3,4,5) ### Find the mean meanX(v) ## [1] 3 Basics 2 - Functions
Data frame Variables To Preview the data frame head(dat) Tail(dat) Observations dat A data frame is used for storing data tables. It is a list of vectors of equal length. To retrieve data in a cell, we would enter its row and column coordinates in the single square bracket "[]" operator. The two coordinates are separated by a comma. Number of Rows
R –Str() Compactly display the internal structure of an R object, a diagnostic function Str(object, ...) tDat If you need a quick overview of your dataset, use the R command str() and look at the structure. tells you something about the classes of your variables and the number of observations. dat$SaleDate <- as.Date(dat$SaleDate) Change the class of column SaleDate
R – Summary() summary(object) distribution of your variables in the dataset tDat Numerical variables: summary() gives you the range, quartiles, median, and mean. Factor variables: summary() gives you a table with frequencies.
Reshaping Data - DPLYR Select Subset variables (Columns). tDat Dat
filter Data - DPLYR Filter() allows you to select a subset of rows in a data frame.
piping- DPLYR %>% Passes object on LHS as first argument to function on RHS
Reshaping Data - tidyr Gather Gather columns into Rows Spread ~ does the opposite tDat gDat
Make new variable (Column) Mutate Compute and appends or or more new columns gDat
Reshaping Data - tidyr Separate Separate one column into several. Spread ~ does the opposite gDat tDat
Visualize Model Transform Get & Tidy Transform @hadleywickham
Data Visualization – ggplot2 Based of Grammar of Graphics One can build every graph from same few components Data set Set of Geom – visual marks that represent the data Coordinate system
Data Visualization – ggplot2 To display data values, map the variables in the dataset to aesthetic properties geom color, size and x and y locations
Data Visualization – ggplot2 Qplot() Creates a complete plot with given data, geom and mapping. Supplies many useful defaults
Data Visualization – ggplot2 Add Layer elements with + Begin a plot that you can finish by adding layers to. No defaults but provides more control then qplot()
Data Visualization – ggplot2 Add Layer elements with + Begin a plot that you can finish by adding layers to. No defaults but provides more control then qplot()
Data Visualization – ggplot2 Lm() Begin a plot that you can finish by adding layers to. No defaults but provides more control then qplot()
Thank you