R PROGRAMMING FOR SQL DEVELOPERS Kiran Math Developer : Proterra in Greenville SC
MOTIVATION
GOAL Raw Sensor Data Tidy Data
ZILLOW
Viz Model Transform Get & Tidy hadleywickham
VASCO DA GAMA BRIDGE - LISBON IN PORTUGAL Question : What is the probability of having seventeen or more vehicles crossing the bridge in a particular minute?
Raw Data Data on Web CSV Format Processing Script R Code Read CSV from the Web into R Tidy Data Packages used : TidyR Data Manipulation and Analysis R Code Average Vehicles per min 12 Data Communication Blog the probability of having seventeen or more Vehicles crossing the bridge in a particular minute is 10.1% Data Visualization R Code ggplot2 baseplot Code Repository GitHub Data Model - Poisson distribution ppois(16, lambda=12, lower=FALSE) # upper tail Answer :
INSTALLATION Comprehensive R Archive Network (CRAN) R Studio
ROBERT GENTLEMAN - ROSS IHAKA University of Auckland
R <- CORE && R <-PACKAGES ggPlot2 sqldf Base Packages rodbc dplyr stringR ggPlot2 reshape2 tidyR lubridate
FEATURES OF R Runs on almost any standard computing platform/OS (even on the PlayStation 3) Frequent releases (annual + bug fix releases); active development. Quite lean, as far as software goes; functionality is divided into modular packages Graphics capabilities very sophisticated and better than most stat packages. Very active and vibrant user community; R-help and R-devel mailing lists and Stack Overflow
DRAWBACKS OF R Essentially based on 40 year old technology. Objects must generally be stored in physical memory;
BASICS 1 - VECTOR # Define a Variable a <- 25 # Call a Variable a ## [1] 25 # Do something to it a + 10 ## [1] 35 # Create a vector - Numeric x <- c(0.5, 0.6,0.7) ## call it x ## # Do something to the vector mean(x) ## [1] 0.6
BASICS 2 - MATRIX A matrix is a collection of data elements arranged in a two- dimensional rectangular layout. > A = matrix( c(1, 2, 3, 4, 5, 6), # the data elements nrow=2, # number of rows ncol=3, # number of columns byrow = TRUE) # fill matrix by rows > A # print the matrix [,1] [,2] [,3] [1,] [2,] 4 5 6
BASICS 3 – CONTROL STRUCTURES #If Statements x <- 10 y 75) 'Pass' else 'Fail' ##Get the value of variable y ## [1] "Fail" ## For loops for (index in 1:3) { print(index) }
BASICS 4 - FUNCTIONS Functions are blocks of code that allow R to be a modular and facilitate code reuse Funct_name <- function ( arg1,arg2,..){ ### do something } ## Compute the mean of the vector of numbers meanX <- function(a_vector) { s <- sum(a_vector) l <- length(a_vector) m <- s/l return(m) } ### create a vector v <- c(1,2,3,4,5) ### Find the mean meanX(v) ## [1] 3
DATA FRAME A data frame is used for storing data tables. To retrieve data in a cell, we would enter its row and column coordinates in the single square bracket "[]" operator. mtcars[1, 2] [1] 6 mtcars["Mazda RX4", "cyl"] [1] 6 Preview data frame head(mtcars) tail(mtcars) View(mtcars)
BASICS 6 - PLOTS # Make a very simple plot # Define Vectors x <- c(1,3,6,9,12) y <- c(1.5,2,7,8,15) plot (x,y, xlab="x axis", ylab="y axis", main="my plot", ylim=c(0,20), xlim=c(0,20), pch=15, col="blue") # add some more points to the graph x2 <- c(0.5, 3, 5, 8, 12) y2 <- c(0.8, 1, 2, 4, 6) points (x2, y2, pch=16, col="green")
HOME SALE I have home sales data in the neighborhood, in sql server database. Question : I have a 3000 sql ft house and how much it will sale for?
REGRESSION MODEL
Demo : Predict sale price of the house that is 3000 sq ft
MANAGING DATA FRAMES WITH DPLYR The dplyr package provides simple functions that can be chained together to easily and quickly manipulate data install.packages ("dplyr") library (dplyr) Verbs 1. filter – select a subset of the rows of a data frame 2. arrange – works similarly to filter, except that instead of filtering or selecting rows, it reorders them 3. select – select columns of a data frame 4. mutate – add new columns to a data frame that are functions of existing columns 5. summarize – summarize values 6. group_by – describe how to break a data frame into groups of rows
DEMO : DPLYR
VISUALIZING DATA FRAMES WITH GGPLOT2 Grammer of Graphics The ggplot2 package provides two workhouse function for plotting 1. qplot() 2. ggplot() install.packages (“ggplot2") library (ggplot2) Building Blocks 1. Data Frame 2. Aesthetics – how data is mapped to color and size ~ aes() 3. Geoms – Geometric objects to be drawn, such as points, lines, bars, polygons and text. 4. Facets – Panels used in conditional Plot 5. Stats – statistical transformation ~ binning, quantiles, smoothing 6. Scales – coding that aesthetic map uses like male = blue and female = red 7. Co-ordinate System
DEMO : GGPLOT2
THANK YOU