Statistical Analysis Data Mining  R is an extremely popular tool for Statistical Analysis and Data Mining. freeopen source  It is free and open source,

Slides:



Advertisements
Similar presentations
Introduction to R Brody Sandel. Topics Approaching your analysis Basic structure of R Basic programming Plotting Spatial data.
Advertisements

Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
MATLAB – What is it? Computing environment / programming language Tool for manipulating matrices Many applications, you just need to get some numbers in.
SPH 247 Statistical Analysis of Laboratory Data 1April 2, 2013SPH 247 Statistical Analysis of Laboratory Data.
Review of Univariate Linear Regression BMTRY 726 3/4/14.
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
Multiple Regression Predicting a response with multiple explanatory variables.
Zinc Data SPH 247 Statistical Analysis of Laboratory Data.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
Regression II Model Selection Model selection based on t-distribution Information criteria Cross-validation.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
FISH 397C Winter 2009 Evan Girvetz Basic Statistical Analyses and Contributed Packages in R © R Foundation, from
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
MATH 3359 Introduction to Mathematical Modeling Linear System, Simple Linear Regression.
Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source:
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
How to plot x-y data and put statistics analysis on GLEON Fellowship Workshop January 14-18, 2013 Sunapee, NH Ari Santoso.
Basic R Programming for Life Science Undergraduate Students Introductory Workshop (Session 1) 1.
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
Introduction to MATLAB Session 1 Prepared By: Dina El Kholy Ahmed Dalal Statistics Course – Biomedical Department -year 3.
PCA Example Air pollution in 41 cities in the USA.
MATH 3359 Introduction to Mathematical Modeling Project Multiple Linear Regression Multiple Logistic Regression.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
732A44 Programming in R.  Self-studies of the course book  2 Lectures (1 in the beginning, 1 in the end)  Labs (computer). Compulsory submission of.
Hands-on Introduction to R. Outline R : A powerful Platform for Statistical Analysis Why bother learning R ? Data, data, data, I cannot make bricks without.
Arko Barman with modification by C.F. Eick COSC 4335 Data Mining Spring 2015.
Why Is It There? Getting Started with Geographic Information Systems Chapter 6.
Use of Weighted Least Squares. In fitting models of the form y i = f(x i ) +  i i = 1………n, least squares is optimal under the condition  1 ……….  n.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
Collaboration and Data Sharing What have I been doing that’s so bad, and how could it be better? August 1 st, 2010.
Regression Model Building LPGA Golf Performance
Using R for Marketing Research Dan Toomey 2/23/2015
FACTORS AFFECTING HOUSING PRICES IN SYRACUSE Sample collected from Zillow in January, 2015 Urban Policy Class Exercise - Lecy.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
An Introduction to R Statistical Computing AMS 597 Stony Brook University Spring 2009 By Tianyi Zhang.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Introduction to Matlab  Matlab is a software package for technical computation.  Matlab allows you to solve many numerical problems including - arrays.
STAT 534: Statistical Computing Hari Narayanan
Linear Models Alan Lee Sample presentation for STATS 760.
Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor BMTRY 701 Biostatistical Methods II.
© 2015 by Wade Rogers Introduction to R Cytomics Workshop December, 2015.
1 Faculty Name Prof. A. A. Saati. 2 MATLAB Fundamentals 3 1.Reading home works ( Applied Numerical Methods )  CHAPTER 2: MATLAB Fundamentals (p.24)
EPP 245 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
Logistic Regression. What is the purpose of Regression?
1 Analysis of Variance (ANOVA) EPP 245/298 Statistical Analysis of Laboratory Data.
Pinellas County Schools
Introduction to R Chris Free. Introduction to R Free! Superior (if not comparable) to commercial alternatives Available on all platforms Not just for.
16BIT IITR Data Collection Module If you have not already done so, download and install R from download.
Before the class starts: Login to a computer Read the Data analysis assignment 1 on MyCourses If you use Stata: Start Stata Start a new do file Open the.
WSUG M AY 2012 EViews, S-Plus and R Damian Staszek Bristol Water.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Data Analytics – ITWS-4600/ITWS-6600
Programming in R Intro, data and programming structures
Introduction to R Samal Dharmarathna.
Résolution de l’ex 1 p40 t=c(2:12);N=c(55,90,135,245,403,665,1100,1810,3000,4450,7350) T=data.frame(t,N,y=log(N));T; > T t N y
Jefferson Davis Research Analytics
Introduction Osborn.
Other Kinds of Arrays Chapter 11
Correlation and regression
Console Editeur : myProg.R 1
Prepared by Lee Revere and John Large
Multi Linear Regression Lab
MIS2502: Data Analytics Introduction to R and RStudio
R Course 1st Lecture.
Data analysis with R and the tidyverse
Presentation transcript:

Statistical Analysis Data Mining  R is an extremely popular tool for Statistical Analysis and Data Mining. freeopen source  It is free and open source, can be installed on any platform with ease collective work  Result of the collective work of several researchers and experts in Data Mining R to be integrated  Prestigious Organizations like SAP, Oracle, Tableau etc. allow R to be integrated with their powerful applications  Visit for Software downloads, Packages, Documents and the latest on Rhttps://cran.r-project.org/ you will have answers for everything  Google is awash with valuable information, blogs, tips etc. on R, you will have answers for everything Introduction

R Data Types (Simple)  Numeric  Integer  Complex  Logical  Character Try the following in R Console on Command prompt (>) “<-” is R’s assignment operator, interchangeably used with “=“ > X <- 10 > class(X) Repeat the same with following and check class(X) each time > X < > X < i > X <- “a” Getting Started Even an integer is by default considered “numeric”, which can be changed to “integer” as follows > X <- as.integer(X) In the same way, as.character(X) would convert a variable to character is.integer(X) or is.character(X) can be used to check whether a variable is integer or character or not

R Data Types (Complex) Vector  A vector is a sequence of data elements of the same basic type.  Vectors can be constituted by integers, numerics, characters and so on  But all elements of a Vector needs to be of same data type.  R would coerce elements to the same data type. Try the following in R Console on Command prompt > v1 <- c(1,2,3) > v2 <- c(4,5,6) > add <- V1 + V2 > sub <- V2 – V2 Getting Started  Vectors for all these operations need to be of same length, or else R throws a warning,  Although it will try to do the operation with recycling the shorter vector  Length(v) gives the length of the vector  V[i] retrieves the ith member of the vector  V[-i] would retrieve all but ith member of the vector  V[i:j] would retriece i to jth member from the vector  Class(v1) would give type of vector element, as numeric, and not as vector Creates two vectors v1 & v2 Creates two vectors add & sub by arithmetic operations on them

Matrix Matrix is similar to a vector but arranged in 2D format with rows and columns. Run the following sequence of commands to build a matrix > mat <- c(1,2,3,4,5,6) > mat <- matrix(mat, nrow = 2, ncol = 3) > dim(mat) returns nrow and ncol for a matrix > mat[n, ] would return nth row > mat[,n] would return nth column > mat[, n:m] would return n to mth columns > mat[, c(n,m)] would return n and m columns > t(mat) would return the transpose of matrix > solve(mat) is inverse of mat; mat has to be a square matrix > A %*%B is multiplication of A and B > rbind(A, B)merges two matrices by rows > cbind(A, B) merges two matrices by columns Matrix Create two matrices and try them in R console Rbind and cbind need matrices of same length

List List is similar to vector but can be heterogeneous and can contain different data types. > list[[i]] refers to the ith element of List. It could be a vector, a matrix or a single numeric Vector or List elements can have names too. > names(V1) <- c(“first”, “second”) > names(list) <- c(“first”, “second”, “third”) would assign names of the elements of the vector V1 or List then the particular element can be retrieved by its name too. > V1[“name”] would fetch named element for vector > list[[“name”]] > list$name would fetch the named element (notice double brackets for list or matrix) Using $ for vector returns an error - $ not valid for atomic vectors List

Data Frame It is perhaps the most important data type for data mining purposes. It is used to store data in a tabular or spreadsheet fashion. It can have several named columns, containing fields and rows represent records or data points. df is name of the Data Frame > colnames(df) is used to name the columns > rownames(df) <- NULL to remove names of rows, desirable sometimes > df[[“name”]] > df$names > df[i, ] > df[,j] > df[i,j] Data Frames A column of a df can be retrieved by these commands ith row can be retrieved jth row can be retrieved Cell from ith row and jth column can be retrieved

R comes loaded with several sample Data sets One such data set is “mtcars”, which has 32 car models with 11 measurement (mile per gallon, # of cylinders etc.) Let us work with this data set > data <- data.frame(mtcars) Try these on mtcars data set > data$mpg gives the “mpg” column from data > data[[“mpg”]]gives the “mpg” column from data > data[2,]gives 2nd row from data > data[,4]gives 4th column from data > data[4,5]gives value in 4th row and 5th column Data Frames

Load Data View Data by clicking on the Object

Create a new Data frame from Data > data1 <- data.frame(data$mpg, data$hp) You may have to rename the columns as below > colnames(data1) <- c(“mpg”, “hp”) Try some Descriptive Statistics on this > mean(data1$mpg) mean for all mpg values > sd(data1$mpg) std. deviation for all mpg values Try some Graphical Statistics on this > hist(data1$mpg) generates histogram for mpg values > plot(data1$hp, data1$mpg) generates a scatter plot for hp vs. mpg Data Frames

CTR + L (lower case) To clear the console To clear Objects from the Workspace To clear Plots from the Window Clean up the Rstudio before existing for others to use

collective work R is the collective work of several researchers and experts in Data Mining librariesPackages They contribute in terms of libraries or set of functions called Packages R works through these packages for specific works, and there is endless list of them They provide multiple ways to achieve the same result in R enormous powerconfusing Provides “enormous power” but can be “confusing” at times installcall One should know the right package, install it and call it as following > install.packages(“package_name”) > library(“package_name”) R Packages All the required packages for this tutorial have already been installed, One can’t install on his own

sqldf Let us work with one of these packages, called sqldf, to perform a very important task on Data Frame subsetting We would be subsetting a Data Frame, i.e., picking part of data by certain conditions > library(“sqldf”) already installed, not need to call install command Try out the following and view the results to understand them > dt 20.0") > dt 20.0 and disp > 200") > temp <- data[c("mpg", "cyl")] > temp <- data[c(1, 2)] the same result is returned > temp <- data[which("mpg" == 21.0 & "hp" == 110)] > temp 20.0 & "disp" > 200) Subsetting Data Frames

Multiple Regression Model Let us build a Multiple Regression Model on the Cars Data The packages for Regression Modelling is part of Base R and one need not call any package Let us attempt to do it step-by-step > result Y ~ x1 + x2 + x3 + …., data Y is response, X are predictors and data is the data frame to be used > Y ~., data dot(.) Means including all the variables except Y in model building Let us review the result > summary(result) r esult is actually a “list” which has several components, summary displays all of them, we can look at the components separately too > result$coefficients > result$residuals Data Mining with R

Multiple Linear Regression Let us do some data discovery before attempting Regression Modelling > plot(data$mpg, data$cyl) > plot(data$mpg, data$disp) > plot(data$mpg, data$hp) > plot(data$mpg, data$drat) > plot(data$mpg, data$wt)  plot(data$mpg, data$qsec)  Based on correlation, we use only 4 of them as below  result <- lm(mpg ~ disp + hp + drat + wt, data)  Result$coefficients or result$residuals to view them in detail  They can be manipulated separately  Residual <- data.frame(result$residuals)  And you have residuals for each Cars  Call:  lm(formula = mpg ~ disp + hp + drat + wt, data = data)  Residuals:  Min 1Q Median 3Q Max   Coefficients:  Estimate Std. Error t value Pr(>|t|)  (Intercept) e-05 ***  disp  hp **  drat  wt **  ---  Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Residual standard error: on 27 degrees of freedom  Multiple R-squared: ,Adjusted R-squared:  F-statistic: on 4 and 27 DF, p-value: 2.704e-10

Multiple Linear Regression the other way Now let us try  result <- glm(mpg ~ disp + hp + drat + wt, family = "gaussian", data)  Any difference in results??  It is same as with lm except we have AIC in place of Residuals  We have used glm (Generalized Linear Model), which lm(Linear Model - Regression) is a subset of, and note family = “Gaussian”  In the same way, we can use “binomial” to model a Binary Logistic Regression  This emphasizes the point that there are various ways to achieve the same objective in R, we have to weigh the options  Call:  lm(formula = mpg ~ disp + hp + drat + wt, data = data)  Residuals:  Min 1Q Median 3Q Max   Coefficients:  Estimate Std. Error t value Pr(>|t|)  (Intercept) e-05 ***  disp  hp **  drat  wt **  ---  Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Residual standard error: on 27 degrees of freedom  Multiple R-squared: ,Adjusted R-squared:  F-statistic: on 4 and 27 DF, p-value: 2.704e-10