attach(EPI_data); > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<- lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH Let’s recall what this taught you! > summary(lmENVH) > cENVH<-coef(lmENVH) 3"> attach(EPI_data); > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<- lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH Let’s recall what this taught you! > summary(lmENVH) > cENVH<-coef(lmENVH) 3">

Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.

Similar presentations


Presentation on theme: "1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models."— Presentation transcript:

1 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

2 Plot tools/ tips http://statmethods.net/advgraphs/layout.html plot points 2

3 Linear and least-squares > multivariate <- read.csv(”EPI_data.csv") > attach(EPI_data); > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<- lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH Let’s recall what this taught you! > summary(lmENVH) > cENVH<-coef(lmENVH) 3

4 Linear and least-squares > lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH Call: lm(formula = ENVHEALTH ~ DALY + AIR_H + WATER_H) Coefficients: (Intercept) DALY AIR_H WATER_H -2.673e-05 5.000e-01 2.500e-01 2.500e-01 > summary(lmENVH) … > cENVH<-coef(lmENVH) 4

5 Linear and least-squares > summary(lmENVH) Call: lm(formula = ENVHEALTH ~ DALY + AIR_H + WATER_H) Residuals: Min 1Q Median 3Q Max -0.0072734 -0.0027299 0.0001145 0.0021423 0.0055205 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.673e-05 6.377e-04 -0.042 0.967 DALY 5.000e-01 1.922e-05 26020.669 <2e-16 *** AIR_H 2.500e-01 1.273e-05 19645.297 <2e-16 *** WATER_H 2.500e-01 1.751e-05 14279.903 <2e-16 *** --- 5 p < 0.01 : very strong presumption against null hypothesis vs. this fit 0.01 < p < 0.05 : strong presumption against null hypothesis 0.05 < p < 0.1 : low presumption against null hypothesis p > 0.1 : no presumption against the null hypothesis

6 Linear and least-squares Continued: --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.003097 on 178 degrees of freedom (49 observations deleted due to missingness) Multiple R-squared: 1,Adjusted R-squared: 1 F-statistic: 3.983e+09 on 3 and 178 DF, p-value: < 2.2e-16 > names(lmENVH) [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" [7] "qr" "df.residual" "na.action" "xlevels" "call" "terms" [13] "model" 6

7 Object of class lm: An object of class "lm" is a list containing at least the following components: coefficientsa named vector of coefficients residualsthe residuals, that is response minus fitted values. fitted.valuesthe fitted mean values. rankthe numeric rank of the fitted linear model. weights(only for weighted fits) the specified weights. df.residualthe residual degrees of freedom. callthe matched call. termsthe terms object used.terms object used. contrasts(only where relevant) the contrasts used. xlevels(only where relevant) a record of the levels of the factors used in fitting. offsetthe offset used (missing if none were used). yif requested, the response used. xif requested, the model matrix used. modelif requested (the default), the model frame used. 7

8 > plot(ENVHEALTH,c ol="red") > points(lmENVH$fitte d.values,col="blue") > Huh? 8 Plot original versus fitted

9 Try again! 9 > plot(ENVHEALTH[!is.na(ENVHEALTH)], col="red") > points(lmENVH$fitted.values,col="blue")

10 Predict > cENVH<- coef(lmENVH) > DALYNEW<- c(seq(5,95,5)) #2 > AIR_HNEW<- c(seq(5,95,5)) #3 > WATER_HNEW<- c(seq(5,95,5)) #4 10

11 Predict > NEW<- data.frame(DALYNEW,AIR_HNEW,WATER_H NEW) > pENV<- predict(lmENVH,NEW,interval=“prediction”) > cENV<- predict(lmENVH,NEW,interval=“confidence”) # look up what this does 11

12 Predict object returns predict.lm produces a vector of predictions or a matrix of predictions and bounds with column names fit, lwr, and upr if interval is set. Access via [,1] etc. If se.fit is TRUE, a list with the following components is returned: fitvector or matrix as above se.fitstandard error of predicted means residual.scaleresidual standard deviations dfdegrees of freedom for residual 12

13 Output from predict > head(pENV) fit lwr upr 1 NA NA NA 2 11.55213 11.54591 11.55834 3 18.29168 18.28546 18.29791 4 NA NA NA 5 69.92533 69.91915 69.93151 6 90.20589 90.19974 90.21204 … 13

14 > tail(pENV) fit lwr upr 226 NA NA NA 227 NA NA NA 228 34.95256 34.94641 34.95871 229 59.00213 58.99593 59.00834 230 24.20951 24.20334 24.21569 231 38.03701 38.03084 38.04319 14

15 Did you repeat this for: ? AIR_E CLIMATE 15

16 K Nearest Neighbors (classification) Scripts – Lab4b_0_2014.R > nyt1<-read.csv(“nyt1.csv") … from week 4b slides or script > classif<-knn(train,test,cg,k=5) # > head(true.labels) [1] 1 0 0 1 1 0 > head(classif) [1] 1 1 1 1 0 0 Levels: 0 1 > ncorrect<-true.labels==classif > table(ncorrect)["TRUE"]# or > length(which(ncorrect)) > What do you conclude? 16

17 Contingency tables > table(nyt1$Impressions,nyt1$Gender) # 0 1 1 69 85 2 389 395 3 975 937 4 1496 1572 5 1897 2012 6 1822 1927 7 1525 1696 8 1142 1203 9 722 711 10 366 400 11 214 200 12 86 101 13 41 43 14 10 9 15 5 7 16 0 4 17 0 1 17 Contingency table - displays the (multivariate) frequency distribution of the variable. Tests for significance (not now) > table(nyt1$Clicks,nyt1$Gender) 0 1 1 10335 10846 2 415 440 3 9 17

18 Regression > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) ) > m1<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bron x) You were reminded that log(0) is … not fun  THINK through what you are doing… Filtering is somewhat inevitable: > bronx 0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),] > m1<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bron x) 18

19 Interpreting this! Call: lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx) Residuals: Min 1Q Median 3Q Max -14.4529 0.0377 0.4160 0.6572 3.8159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.0271 0.3088 22.75 <2e-16 *** log(GROSS.SQUARE.FEET) 0.7013 0.0379 18.50 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.95 on 2435 degrees of freedom Multiple R-squared: 0.1233, Adjusted R-squared: 0.1229 F-statistic: 342.4 on 1 and 2435 DF, p-value: < 2.2e-16 19

20 Plots – tell me what they tell you! 20

21 Solution model 2 > m2<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEE T)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBO RHOOD),data=bronx) > summary(m2) > plot(resid(m2)) # > m2a<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD),data=bronx) > summary(m2a) > plot(resid(m2a)) 21

22 22 How do you interpret this residual plot?

23 Solution model 3 and 4 > m3<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),dat a=bronx) > summary(m3) > plot(resid(m3)) # > m4<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),dat a=bronx) > summary(m4) > plot(resid(m4)) 23

24 24 And this one?

25 Did you get to create the sales map? table(mapcoord$NEIGHBORHOOD) # contingency table mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD) # and this? geoPlot(mapcoord,zoom=12,color=mapcoord$NEIGH BORHOOD) # this one is easier 25

26 26

27 Did you forget the KNN? #almost there. mapcoord$class<as.numeric(mapcoord$NEIG HBORHOOD) nclass<-dim(mapcoord)[1] split<-0.8 trainid<-sample.int(nclass,floor(split*nclass)) testid<-(1:nclass)[-trainid] ##mappred<-mapcoord[testid,] ##mappred$class<as.numeric(mappred$NEIG HBORHOOD) 27

28 KNN! Did you loop over k? knnpred<- knn(mapcoord[trainid,3:4],mapcoord[testid,3:4], cl=mapcoord[trainid,2],k=5) knntesterr<-sum(knnpred!=mapcoord [testid,2] )/length(testid) 28

29 K-Means! > mapmeans<-data.frame(adduse$ZIP.CODE, as.numeric(mapcoord$NEIGHBORHOOD), adduse$TOTAL.UNITS, adduse$"LAND.SQUARE.FEET", adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude') > mapobj<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) > fitted(mapobj,method=c("centers","classes")) 29

30 Return object clusterA vector of integers (from 1:k) indicating the cluster to which each point is allocated. centersA matrix of cluster centres. totssThe total sum of squares. withinssVector of within-cluster sum of squares, one component per cluster. tot.withinssTotal within-cluster sum of squares, i.e., sum(withinss). betweenssThe between-cluster sum of squares, i.e. totss-tot.withinss. sizeThe number of points in each cluster. 30

31 31 Huh? What is this? plot(mapmeans, mapobj$cluster)

32 Plotting clusters (preview) library(cluster) clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) # Centroid Plot against 1st 2 discriminant functions library(fpc) plotcluster(mapmeans, mapobj$cluster) 32

33 Comparing cluster fits (e.g. different k) library(fpc) cluster.stats(d, fit1$cluster, fit2$cluster) Use help. > help(plotcluster) > help(cluster.stats) 33

34 Assignment 3? Preliminary and Statistical Analysis. Due next Friday. 15% (written) –Distribution analysis and comparison, visual ‘analysis’, statistical model fitting and testing of some of the nyt1…31 datasets. How is it going? 34

35 Assignments to come Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ early March. 15% (10% written and 5% oral; individual); Assignment 5: Term project proposal. Due ~ week 7. 5% (0% written and 5% oral; individual); Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 9. 15% (15% written; individual); Term project. Due ~ week 13. 30% (25% written, 5% oral; individual). 35

36 Admin info (keep/ print this slide) Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: pfox@cs.rpi.edu, 518.276.4862 (do not leave a msg)pfox@cs.rpi.edu Contact hours: Monday** 3:00-4:00pm (or by email appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by email) TA: Lakshmi Chenicheri chenil@rpi.educhenil@rpi.edu Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014http://tw.rpi.edu/web/courses/DataAnalytics/2014 –Schedule, lectures, syllabus, reading, assignments, etc. 36


Download ppt "1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models."

Similar presentations


Ads by Google