Statistical Analysis Data Mining R is an extremely popular tool for Statistical Analysis and Data Mining. freeopen source It is free and open source, can be installed on any platform with ease collective work Result of the collective work of several researchers and experts in Data Mining R to be integrated Prestigious Organizations like SAP, Oracle, Tableau etc. allow R to be integrated with their powerful applications Visit for Software downloads, Packages, Documents and the latest on Rhttps://cran.r-project.org/ you will have answers for everything Google is awash with valuable information, blogs, tips etc. on R, you will have answers for everything Introduction
R Data Types (Simple) Numeric Integer Complex Logical Character Try the following in R Console on Command prompt (>) “<-” is R’s assignment operator, interchangeably used with “=“ > X <- 10 > class(X) Repeat the same with following and check class(X) each time > X < > X < i > X <- “a” Getting Started Even an integer is by default considered “numeric”, which can be changed to “integer” as follows > X <- as.integer(X) In the same way, as.character(X) would convert a variable to character is.integer(X) or is.character(X) can be used to check whether a variable is integer or character or not
R Data Types (Complex) Vector A vector is a sequence of data elements of the same basic type. Vectors can be constituted by integers, numerics, characters and so on But all elements of a Vector needs to be of same data type. R would coerce elements to the same data type. Try the following in R Console on Command prompt > v1 <- c(1,2,3) > v2 <- c(4,5,6) > add <- V1 + V2 > sub <- V2 – V2 Getting Started Vectors for all these operations need to be of same length, or else R throws a warning, Although it will try to do the operation with recycling the shorter vector Length(v) gives the length of the vector V[i] retrieves the ith member of the vector V[-i] would retrieve all but ith member of the vector V[i:j] would retriece i to jth member from the vector Class(v1) would give type of vector element, as numeric, and not as vector Creates two vectors v1 & v2 Creates two vectors add & sub by arithmetic operations on them
Matrix Matrix is similar to a vector but arranged in 2D format with rows and columns. Run the following sequence of commands to build a matrix > mat <- c(1,2,3,4,5,6) > mat <- matrix(mat, nrow = 2, ncol = 3) > dim(mat) returns nrow and ncol for a matrix > mat[n, ] would return nth row > mat[,n] would return nth column > mat[, n:m] would return n to mth columns > mat[, c(n,m)] would return n and m columns > t(mat) would return the transpose of matrix > solve(mat) is inverse of mat; mat has to be a square matrix > A %*%B is multiplication of A and B > rbind(A, B)merges two matrices by rows > cbind(A, B) merges two matrices by columns Matrix Create two matrices and try them in R console Rbind and cbind need matrices of same length
List List is similar to vector but can be heterogeneous and can contain different data types. > list[[i]] refers to the ith element of List. It could be a vector, a matrix or a single numeric Vector or List elements can have names too. > names(V1) <- c(“first”, “second”) > names(list) <- c(“first”, “second”, “third”) would assign names of the elements of the vector V1 or List then the particular element can be retrieved by its name too. > V1[“name”] would fetch named element for vector > list[[“name”]] > list$name would fetch the named element (notice double brackets for list or matrix) Using $ for vector returns an error - $ not valid for atomic vectors List
Data Frame It is perhaps the most important data type for data mining purposes. It is used to store data in a tabular or spreadsheet fashion. It can have several named columns, containing fields and rows represent records or data points. df is name of the Data Frame > colnames(df) is used to name the columns > rownames(df) <- NULL to remove names of rows, desirable sometimes > df[[“name”]] > df$names > df[i, ] > df[,j] > df[i,j] Data Frames A column of a df can be retrieved by these commands ith row can be retrieved jth row can be retrieved Cell from ith row and jth column can be retrieved
R comes loaded with several sample Data sets One such data set is “mtcars”, which has 32 car models with 11 measurement (mile per gallon, # of cylinders etc.) Let us work with this data set > data <- data.frame(mtcars) Try these on mtcars data set > data$mpg gives the “mpg” column from data > data[[“mpg”]]gives the “mpg” column from data > data[2,]gives 2nd row from data > data[,4]gives 4th column from data > data[4,5]gives value in 4th row and 5th column Data Frames
Load Data View Data by clicking on the Object
Create a new Data frame from Data > data1 <- data.frame(data$mpg, data$hp) You may have to rename the columns as below > colnames(data1) <- c(“mpg”, “hp”) Try some Descriptive Statistics on this > mean(data1$mpg) mean for all mpg values > sd(data1$mpg) std. deviation for all mpg values Try some Graphical Statistics on this > hist(data1$mpg) generates histogram for mpg values > plot(data1$hp, data1$mpg) generates a scatter plot for hp vs. mpg Data Frames
CTR + L (lower case) To clear the console To clear Objects from the Workspace To clear Plots from the Window Clean up the Rstudio before existing for others to use
collective work R is the collective work of several researchers and experts in Data Mining librariesPackages They contribute in terms of libraries or set of functions called Packages R works through these packages for specific works, and there is endless list of them They provide multiple ways to achieve the same result in R enormous powerconfusing Provides “enormous power” but can be “confusing” at times installcall One should know the right package, install it and call it as following > install.packages(“package_name”) > library(“package_name”) R Packages All the required packages for this tutorial have already been installed, One can’t install on his own
sqldf Let us work with one of these packages, called sqldf, to perform a very important task on Data Frame subsetting We would be subsetting a Data Frame, i.e., picking part of data by certain conditions > library(“sqldf”) already installed, not need to call install command Try out the following and view the results to understand them > dt 20.0") > dt 20.0 and disp > 200") > temp <- data[c("mpg", "cyl")] > temp <- data[c(1, 2)] the same result is returned > temp <- data[which("mpg" == 21.0 & "hp" == 110)] > temp 20.0 & "disp" > 200) Subsetting Data Frames
Multiple Regression Model Let us build a Multiple Regression Model on the Cars Data The packages for Regression Modelling is part of Base R and one need not call any package Let us attempt to do it step-by-step > result Y ~ x1 + x2 + x3 + …., data Y is response, X are predictors and data is the data frame to be used > Y ~., data dot(.) Means including all the variables except Y in model building Let us review the result > summary(result) r esult is actually a “list” which has several components, summary displays all of them, we can look at the components separately too > result$coefficients > result$residuals Data Mining with R
Multiple Linear Regression Let us do some data discovery before attempting Regression Modelling > plot(data$mpg, data$cyl) > plot(data$mpg, data$disp) > plot(data$mpg, data$hp) > plot(data$mpg, data$drat) > plot(data$mpg, data$wt) plot(data$mpg, data$qsec) Based on correlation, we use only 4 of them as below result <- lm(mpg ~ disp + hp + drat + wt, data) Result$coefficients or result$residuals to view them in detail They can be manipulated separately Residual <- data.frame(result$residuals) And you have residuals for each Cars Call: lm(formula = mpg ~ disp + hp + drat + wt, data = data) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** disp hp ** drat wt ** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 27 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 4 and 27 DF, p-value: 2.704e-10
Multiple Linear Regression the other way Now let us try result <- glm(mpg ~ disp + hp + drat + wt, family = "gaussian", data) Any difference in results?? It is same as with lm except we have AIC in place of Residuals We have used glm (Generalized Linear Model), which lm(Linear Model - Regression) is a subset of, and note family = “Gaussian” In the same way, we can use “binomial” to model a Binary Logistic Regression This emphasizes the point that there are various ways to achieve the same objective in R, we have to weigh the options Call: lm(formula = mpg ~ disp + hp + drat + wt, data = data) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** disp hp ** drat wt ** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 27 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 4 and 27 DF, p-value: 2.704e-10