R PROGRAMMING FOR SQL DEVELOPERS Kiran Math Developer : Proterra in Greenville SC
MOTIVATION
GOAL Raw Sensor Data Tidy Data
ZILLOW
INSTALLATION Comprehensive R Archive Network (CRAN) R Studio
R <- CORE && R <-PACKAGES ggPlot2 sqldf Base Packages rodbc dplyr stringR ggPlot2 reshape2 tidyR lubridate
BASICS 1 - VECTOR # Define a Variable a <- 25 # Call a Variable a ## [1] 25 # Do something to it a + 10 ## [1] 35 # Create a vector - Numeric x <- c(0.5, 0.6,0.7) ## call it x ## # Do something to the vector mean(x) ## [1] 0.6
BASICS 2 - FUNCTIONS Functions are blocks of code that allow R to be a modular and facilitate code reuse Funct_name <- function ( arg1,arg2,..){ ### do something } ## Compute the mean of the vector of numbers meanX <- function(a_vector) { s <- sum(a_vector) l <- length(a_vector) m <- s/l return(m) } ### create a vector v <- c(1,2,3,4,5) ### Find the mean meanX(v) ## [1] 3
HOME SALE Question : I have a 3000 sql ft house and how much it will sale for?
Visualize Model Transform Get & Tidy hadleywickham
GET DATA – FROM SQL SERVER
GET DATA – FROM CSV FILE
DATA FRAME dat[5,3] To Preview the data frame head(dat) Tail(dat) Variables Observations dat Number of Rows
R –STR() Str(object,...) dat$SaleDate <- as.Date(dat$SaleDate) Compactly display the internal str ucture of an R object, a diagnostic function Change the class of column SaleDate tDat
R – SUMMARY() summary(object) distribution of your variables in the dataset tDat
RESHAPING DATA - DPLYR Select Subset variables (Columns). tDat Dat
FILTER DATA - DPLYR Filter() allows you to select a subset of rows in a data frame.
PIPING- DPLYR %>% Passes object on LHS as first argument to function on RHS
RESHAPING DATA - TIDYR Gather Spread ~ does the opposite Gather columns into Rows gDat tDat
MAKE NEW VARIABLE (COLUMN) Mutate Compute and appends or or more new columns gDat
RESHAPING DATA - TIDYR Separate Spread ~ does the opposite Separate one column into several. gDat tDat
Visualize Model Transform Get & Tidy hadleywickham
DATA VISUALIZATION – GGPLOT2 ggplot2 Based of Grammar of Graphics One can build every graph from same few components Data set Set of Geom – visual marks that represent the data Coordinate system
DATA VISUALIZATION – GGPLOT2 ggplot2 To display data values, map the variables in the dataset to aesthetic properties geom color, size and x and y locations
DATA VISUALIZATION – GGPLOT2 Qplot()
DATA VISUALIZATION – GGPLOT2 ggplot() Add Layer elements with +
DATA VISUALIZATION – GGPLOT2 ggplot() Add Layer elements with +
LINEAR REGRESSION MODEL
LEAST SQUARE METHOD R Function Lm()
MODEL - CORRELATION Cor() Is Area correlated to Sale Price? The value o/p is between 0 and 1
MODEL - PREDICTION
DATA VISUALIZATION – GGPLOT2 Lm()
HOME SALE Question : I have a 3000 sql ft house and how much it will sale for? Answer : $198,000
THANK YOU