Data Analytics – ITWS-4600/ITWS-6600

Slides:



Advertisements
Similar presentations
Autocorrelation and Heteroskedasticity
Advertisements

Linear regression models in R (session 1) Tom Price 3 March 2009.
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
SPH 247 Statistical Analysis of Laboratory Data 1April 2, 2013SPH 247 Statistical Analysis of Laboratory Data.
5/11/ lecture 71 STATS 330: Lecture 7. 5/11/ lecture 72 Prediction Aims of today’s lecture  Describe how to use the regression model to.
Ch11 Curve Fitting Dr. Deshi Ye
Multiple Regression Predicting a response with multiple explanatory variables.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Baburao Kamble (Ph.D) University of Nebraska-Lincoln
Example of Simple and Multiple Regression
Simple Linear Regression
A quick introduction to R prog. 淡江統計 陳景祥 (Steve Chen)
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2b, February 6, 2015 Lab exercises: beginning to work with data: filtering, distributions, populations,
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
Using R for Marketing Research Dan Toomey 2/23/2015
FACTORS AFFECTING HOUSING PRICES IN SYRACUSE Sample collected from Zillow in January, 2015 Urban Policy Class Exercise - Lecy.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
1 1 Slide Simple Linear Regression Estimation and Residuals Chapter 14 BA 303 – Spring 2011.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Linear Models Alan Lee Sample presentation for STATS 760.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 2b, February 5, 2016 Lab exercises: beginning to work with data: filtering, distributions, populations,
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.
1 Analysis of Variance (ANOVA) EPP 245/298 Statistical Analysis of Laboratory Data.
Before the class starts: Login to a computer Read the Data analysis assignment 1 on MyCourses If you use Stata: Start Stata Start a new do file Open the.
Stats Methods at IC Lecture 3: Regression.
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Data Analytics – ITWS-4963/ITWS-6965
Lecture 11: Simple Linear Regression
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Chapter 12 Simple Linear Regression and Correlation
Lab exercises: beginning to work with data: filtering, distributions, populations, significance testing… Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600.
Résolution de l’ex 1 p40 t=c(2:12);N=c(55,90,135,245,403,665,1100,1810,3000,4450,7350) T=data.frame(t,N,y=log(N));T; > T t N y
CHAPTER 7 Linear Correlation & Regression Methods
Group 1 Lab 2 exercises /assignment 2
Data Analytics – ITWS-4963/ITWS-6965
Correlation and regression
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Group 1 Lab 2 exercises and Assignment 2
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
CHAPTER 29: Multiple Regression*
Console Editeur : myProg.R 1
PSY 626: Bayesian Statistics for Psychological Science
Chapter 12 Simple Linear Regression and Correlation
Multi Linear Regression Lab
Assignment 2 (in lab) Peter Fox and Greg Hughes
Obtaining the Regression Line in R
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
CHAPTER 12 More About Regression
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Essentials of Statistics for Business and Economics (8e)
Estimating the Variance of the Error Terms
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Introduction to Regression
Group 1 Lab 2 exercises and Assignment 2
Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Presentation transcript:

Data Analytics – ITWS-4600/ITWS-6600 Lab: regression, kNN and K-means results, interpreting and evaluating models Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 4b, February 19, 2016

Classification (2) Retrieve the abalone.csv dataset Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope: a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Perform knn classification to get predictors for Age (Rings). Interpretation not required.

What did you get? See pdf – linked off course website

Clustering (3) The Iris dataset (in R use data(“iris”) to load it) The 5th column is the species and you want to find how many clusters without using that information Create a new data frame and remove the fifth column Apply kmeans (you choose k) with 1000 iterations Use table(iris[,5],<your clustering>) to assess results

Return object cluster A vector of integers (from 1:k) indicating the cluster to which each point is allocated. centers A matrix of cluster centres. totss The total sum of squares. withinss Vector of within-cluster sum of squares, one component per cluster. tot.withinss Total within-cluster sum of squares, i.e., sum(withinss). betweenss The between-cluster sum of squares, i.e. totss-tot.withinss. size The number of points in each cluster.

Contingency tables See pdf file – linked off course website

Contingency tables > table(nyt1$Impressions,nyt1$Gender) # 0 1 1 69 85 2 389 395 3 975 937 4 1496 1572 5 1897 2012 6 1822 1927 7 1525 1696 8 1142 1203 9 722 711 10 366 400 11 214 200 12 86 101 13 41 43 14 10 9 15 5 7 16 0 4 17 0 1 Contingency table - displays the (multivariate) frequency distribution of the variable. Tests for significance (not now) > table(nyt1$Clicks,nyt1$Gender) 0 1 1 10335 10846 2 415 440 3 9 17

Regression Exercises Using the EPI dataset find the single most important factor in increasing the EPI in a given region Examine distributions down to the leaf nodes and build up an EPI “model”

Linear and least-squares > EPI_data<- read.csv(”EPI_data.csv") > attach(EPI_data) > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH … (what should you get?) > summary(lmENVH) … > cENVH<-coef(lmENVH)

Linear and least-squares > lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH Call: lm(formula = ENVHEALTH ~ DALY + AIR_H + WATER_H) Coefficients: (Intercept) DALY AIR_H WATER_H -2.673e-05 5.000e-01 2.500e-01 2.500e-01 > summary(lmENVH) … > cENVH<-coef(lmENVH)

Read the documentation!

Linear and least-squares > summary(lmENVH) Call: lm(formula = ENVHEALTH ~ DALY + AIR_H + WATER_H) Residuals: Min 1Q Median 3Q Max -0.0072734 -0.0027299 0.0001145 0.0021423 0.0055205 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.673e-05 6.377e-04 -0.042 0.967 DALY 5.000e-01 1.922e-05 26020.669 <2e-16 *** AIR_H 2.500e-01 1.273e-05 19645.297 <2e-16 *** WATER_H 2.500e-01 1.751e-05 14279.903 <2e-16 *** --- p < 0.01 : very strong presumption against null hypothesis vs. this fit 0.01 < p < 0.05 : strong presumption against null hypothesis 0.05 < p < 0.1 : low presumption against null hypothesis p > 0.1 : no presumption against the null hypothesis

Linear and least-squares Continued: --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.003097 on 178 degrees of freedom (49 observations deleted due to missingness) Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 3.983e+09 on 3 and 178 DF, p-value: < 2.2e-16 > names(lmENVH) [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" [7] "qr" "df.residual" "na.action" "xlevels" "call" "terms" [13] "model"

Object of class lm: An object of class "lm" is a list containing at least the following components: coefficients a named vector of coefficients residuals the residuals, that is response minus fitted values. fitted.values the fitted mean values. rank the numeric rank of the fitted linear model. weights (only for weighted fits) the specified weights. df.residual the residual degrees of freedom. call the matched call. terms the terms object used. contrasts (only where relevant) the contrasts used. xlevels (only where relevant) a record of the levels of the factors used in fitting. offset the offset used (missing if none were used). y if requested, the response used. x if requested, the model matrix used. model if requested (the default), the model frame used.

Plot original versus fitted > plot(ENVHEALTH,col="red") > points(lmENVH$fitted.values,col="blue") > Huh?

Try again! > plot(ENVHEALTH[!is.na(ENVHEALTH)], col="red") > points(lmENVH$fitted.values,col="blue")

Predict > cENVH<-coef(lmENVH) > DALYNEW<-c(seq(5,95,5)) #2 > AIR_HNEW<-c(seq(5,95,5)) #3 > WATER_HNEW<-c(seq(5,95,5)) #4

Predict > NEW<-data.frame(DALYNEW,AIR_HNEW,WATER_HNEW) > pENV<- predict(lmENVH,NEW,interval=“prediction”) > cENV<- predict(lmENVH,NEW,interval=“confidence”) # look up what this does

Predict object returns predict.lm produces a vector of predictions or a matrix of predictions and bounds with column names fit, lwr, and upr if interval is set. Access via [,1] etc. If se.fit is TRUE, a list with the following components is returned: fit vector or matrix as above se.fit standard error of predicted means residual.scale residual standard deviations df degrees of freedom for residual

Output from predict > head(pENV) fit lwr upr 1 NA NA NA 2 11.55213 11.54591 11.55834 3 18.29168 18.28546 18.29791 4 NA NA NA 5 69.92533 69.91915 69.93151 6 90.20589 90.19974 90.21204 …

> tail(pENV) fit lwr upr 226 NA NA NA 227 NA NA NA 228 34. 95256 34

Read the documentation!

Classification Exercises (Lab3b_knn1_2016.R) > nyt1<-read.csv(“nyt1.csv") > nyt1<-nyt1[which(nyt1$Impressions>0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1] # shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value) # interpretation to come!

K Nearest Neighbors (classification) Script – Lab3b_knn1_2016.R > nyt1<-read.csv(“nyt1.csv") … from week 3b slides or script > classif<-knn(train,test,cg,k=5) # > head(true.labels) [1] 1 0 0 1 1 0 > head(classif) [1] 1 1 1 1 0 0 Levels: 0 1 > ncorrect<-true.labels==classif > table(ncorrect)["TRUE"] # or > length(which(ncorrect)) > What do you conclude?

Classification Exercises (Lab3b_knn2_2016.R) 2 examples in the script

Clustering Exercises Lab3b_kmeans1_2016.R Lab3b_kmeans2_2016.R – plotting up results from the iris clustering

Regression > bronx<-read.xlsx(”<x>/rollingsales_bronx.xls",pattern="BOROUGH",stringsAsFactors=FALSE,sheetIndex=1,startRow=5,header=TRUE) > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) ) > m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx) What’s wrong?

Clean up… > bronx<-bronx[which(bronx$GROSS.SQUARE.FEET>0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),] Or bronx1<-bronx[which(bronx$GROSS.SQUARE.FEET!="0" & bronx$LAND.SQUARE.FEET!="0” & bronx$SALE.PRICE!="$0"),] > m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx) > summary(m1)

Call: lm(formula = log(SALE. PRICE) ~ log(GROSS. SQUARE Call: lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx) Residuals: Min 1Q Median 3Q Max -14.4529 0.0377 0.4160 0.6572 3.8159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.0271 0.3088 22.75 <2e-16 *** log(GROSS.SQUARE.FEET) 0.7013 0.0379 18.50 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.95 on 2435 degrees of freedom Multiple R-squared: 0.1233, Adjusted R-squared: 0.1229 F-statistic: 342.4 on 1 and 2435 DF, p-value: < 2.2e-16

Plot > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE)) > abline(m1,col="red",lwd=2) # then > plot(resid(m1))

Another model (2)? Add two more variables to the linear model LAND.SQUARE.FEET and NEIGHBORHOOD Repeat but suppress the intercept (2a)

Model 3/4 Model 3 Log(SALE.PRICE) vs. no intercept Log(GROSS.SQUARE.FEET), Log(LAND.SQUARE.FEET), NEIGHBORHOOD, BUILDING.CLASS.CATEGORY Model 4 Log(SALE.PRICE) vs. no intercept Log(GROSS.SQUARE.FEET), Log(LAND.SQUARE.FEET), NEIGHBORHOOD*BUILDING.CLASS.CATEGORY

Solution model 2 > m2<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx) > summary(m2) > plot(resid(m2)) # > m2a<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx) > summary(m2a) > plot(resid(m2a))

Solution model 3 and 4 > m3<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx) > summary(m3) > plot(resid(m3)) # > m4<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx) > summary(m4) > plot(resid(m4))

Assignment 3 Preliminary and Statistical Analysis. Due ~ March 4. 15% (written) Distribution analysis and comparison, visual ‘analysis’, statistical model fitting and testing of some of the nyt2…31 datasets. See website… for Assignment details.