1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models
Plot tools/ tips plot points 2
Linear and least-squares > multivariate <- read.csv(”EPI_data.csv") > attach(EPI_data); > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<- lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH Let’s recall what this taught you! > summary(lmENVH) > cENVH<-coef(lmENVH) 3
Linear and least-squares > lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH Call: lm(formula = ENVHEALTH ~ DALY + AIR_H + WATER_H) Coefficients: (Intercept) DALY AIR_H WATER_H e e e e-01 > summary(lmENVH) … > cENVH<-coef(lmENVH) 4
Linear and least-squares > summary(lmENVH) Call: lm(formula = ENVHEALTH ~ DALY + AIR_H + WATER_H) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e e DALY 5.000e e <2e-16 *** AIR_H 2.500e e <2e-16 *** WATER_H 2.500e e <2e-16 *** p < 0.01 : very strong presumption against null hypothesis vs. this fit 0.01 < p < 0.05 : strong presumption against null hypothesis 0.05 < p < 0.1 : low presumption against null hypothesis p > 0.1 : no presumption against the null hypothesis
Linear and least-squares Continued: --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 178 degrees of freedom (49 observations deleted due to missingness) Multiple R-squared: 1,Adjusted R-squared: 1 F-statistic: 3.983e+09 on 3 and 178 DF, p-value: < 2.2e-16 > names(lmENVH) [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" [7] "qr" "df.residual" "na.action" "xlevels" "call" "terms" [13] "model" 6
Object of class lm: An object of class "lm" is a list containing at least the following components: coefficientsa named vector of coefficients residualsthe residuals, that is response minus fitted values. fitted.valuesthe fitted mean values. rankthe numeric rank of the fitted linear model. weights(only for weighted fits) the specified weights. df.residualthe residual degrees of freedom. callthe matched call. termsthe terms object used.terms object used. contrasts(only where relevant) the contrasts used. xlevels(only where relevant) a record of the levels of the factors used in fitting. offsetthe offset used (missing if none were used). yif requested, the response used. xif requested, the model matrix used. modelif requested (the default), the model frame used. 7
> plot(ENVHEALTH,c ol="red") > points(lmENVH$fitte d.values,col="blue") > Huh? 8 Plot original versus fitted
Try again! 9 > plot(ENVHEALTH[!is.na(ENVHEALTH)], col="red") > points(lmENVH$fitted.values,col="blue")
Predict > cENVH<- coef(lmENVH) > DALYNEW<- c(seq(5,95,5)) #2 > AIR_HNEW<- c(seq(5,95,5)) #3 > WATER_HNEW<- c(seq(5,95,5)) #4 10
Predict > NEW<- data.frame(DALYNEW,AIR_HNEW,WATER_H NEW) > pENV<- predict(lmENVH,NEW,interval=“prediction”) > cENV<- predict(lmENVH,NEW,interval=“confidence”) # look up what this does 11
Predict object returns predict.lm produces a vector of predictions or a matrix of predictions and bounds with column names fit, lwr, and upr if interval is set. Access via [,1] etc. If se.fit is TRUE, a list with the following components is returned: fitvector or matrix as above se.fitstandard error of predicted means residual.scaleresidual standard deviations dfdegrees of freedom for residual 12
Output from predict > head(pENV) fit lwr upr 1 NA NA NA NA NA NA … 13
> tail(pENV) fit lwr upr 226 NA NA NA 227 NA NA NA
Did you repeat this for: ? AIR_E CLIMATE 15
K Nearest Neighbors (classification) Scripts – Lab4b_0_2014.R > nyt1<-read.csv(“nyt1.csv") … from week 4b slides or script > classif<-knn(train,test,cg,k=5) # > head(true.labels) [1] > head(classif) [1] Levels: 0 1 > ncorrect<-true.labels==classif > table(ncorrect)["TRUE"]# or > length(which(ncorrect)) > What do you conclude? 16
Contingency tables > table(nyt1$Impressions,nyt1$Gender) # Contingency table - displays the (multivariate) frequency distribution of the variable. Tests for significance (not now) > table(nyt1$Clicks,nyt1$Gender)
Regression > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) ) > m1<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bron x) You were reminded that log(0) is … not fun THINK through what you are doing… Filtering is somewhat inevitable: > bronx 0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),] > m1<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bron x) 18
Interpreting this! Call: lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) <2e-16 *** log(GROSS.SQUARE.FEET) <2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.95 on 2435 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 2435 DF, p-value: < 2.2e-16 19
Plots – tell me what they tell you! 20
Solution model 2 > m2<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEE T)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBO RHOOD),data=bronx) > summary(m2) > plot(resid(m2)) # > m2a<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD),data=bronx) > summary(m2a) > plot(resid(m2a)) 21
22 How do you interpret this residual plot?
Solution model 3 and 4 > m3<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),dat a=bronx) > summary(m3) > plot(resid(m3)) # > m4<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),dat a=bronx) > summary(m4) > plot(resid(m4)) 23
24 And this one?
Did you get to create the sales map? table(mapcoord$NEIGHBORHOOD) # contingency table mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD) # and this? geoPlot(mapcoord,zoom=12,color=mapcoord$NEIGH BORHOOD) # this one is easier 25
26
Did you forget the KNN? #almost there. mapcoord$class<as.numeric(mapcoord$NEIG HBORHOOD) nclass<-dim(mapcoord)[1] split<-0.8 trainid<-sample.int(nclass,floor(split*nclass)) testid<-(1:nclass)[-trainid] ##mappred<-mapcoord[testid,] ##mappred$class<as.numeric(mappred$NEIG HBORHOOD) 27
KNN! Did you loop over k? knnpred<- knn(mapcoord[trainid,3:4],mapcoord[testid,3:4], cl=mapcoord[trainid,2],k=5) knntesterr<-sum(knnpred!=mapcoord [testid,2] )/length(testid) 28
K-Means! > mapmeans<-data.frame(adduse$ZIP.CODE, as.numeric(mapcoord$NEIGHBORHOOD), adduse$TOTAL.UNITS, adduse$"LAND.SQUARE.FEET", adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude') > mapobj<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) > fitted(mapobj,method=c("centers","classes")) 29
Return object clusterA vector of integers (from 1:k) indicating the cluster to which each point is allocated. centersA matrix of cluster centres. totssThe total sum of squares. withinssVector of within-cluster sum of squares, one component per cluster. tot.withinssTotal within-cluster sum of squares, i.e., sum(withinss). betweenssThe between-cluster sum of squares, i.e. totss-tot.withinss. sizeThe number of points in each cluster. 30
31 Huh? What is this? plot(mapmeans, mapobj$cluster)
Plotting clusters (preview) library(cluster) clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) # Centroid Plot against 1st 2 discriminant functions library(fpc) plotcluster(mapmeans, mapobj$cluster) 32
Comparing cluster fits (e.g. different k) library(fpc) cluster.stats(d, fit1$cluster, fit2$cluster) Use help. > help(plotcluster) > help(cluster.stats) 33
Assignment 3? Preliminary and Statistical Analysis. Due next Friday. 15% (written) –Distribution analysis and comparison, visual ‘analysis’, statistical model fitting and testing of some of the nyt1…31 datasets. How is it going? 34
Assignments to come Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ early March. 15% (10% written and 5% oral; individual); Assignment 5: Term project proposal. Due ~ week 7. 5% (0% written and 5% oral; individual); Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 9. 15% (15% written; individual); Term project. Due ~ week % (25% written, 5% oral; individual). 35
Admin info (keep/ print this slide) Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: (do not leave a Contact hours: Monday** 3:00-4:00pm (or by appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by ) TA: Lakshmi Chenicheri Web site: –Schedule, lectures, syllabus, reading, assignments, etc. 36