Data Analytics – ITWS-4963/ITWS-6965 Lab exercises: working with a real dataset, more regression, kNN and K-means, and plotting… Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 27, 2015
Plot tools/ tips http://statmethods.net/advgraphs/layout.html http://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/ pairs, gpairs, scatterplot.matrix, clustergram, etc. data() # precip, presidents, iris, swiss, sunspot.month (!), environmental, ethanol, ionosphere More script fragments in R will be available on the web site (http://escience.rpi.edu/data/DA )
If you did the regression… Lab last Friday, skip to slide 12 Otherwise review slides 5-11 and make sure you understand what you are getting A lot of manipulating to do…
K Nearest Neighbors (classification) Scripts – Lab4b_0_2014.R > nyt1<-read.csv(“nyt1.csv") … from week 3b slides or script > classif<-knn(train,test,cg,k=5) # > head(true.labels) [1] 1 0 0 1 1 0 > head(classif) [1] 1 1 1 1 0 0 Levels: 0 1 > ncorrect<-true.labels==classif > table(ncorrect)["TRUE"] # or > length(which(ncorrect)) > What do you conclude?
Regression You were reminded that log(0) is … not fun > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) ) > m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx) You were reminded that log(0) is … not fun THINK through what you are doing… Filtering is somewhat inevitable: > bronx<-bronx[which(bronx$GROSS.SQUARE.FEET>0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),]
Interpreting this! Call: lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx) Residuals: Min 1Q Median 3Q Max -14.4529 0.0377 0.4160 0.6572 3.8159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.0271 0.3088 22.75 <2e-16 *** log(GROSS.SQUARE.FEET) 0.7013 0.0379 18.50 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.95 on 2435 degrees of freedom Multiple R-squared: 0.1233, Adjusted R-squared: 0.1229 F-statistic: 342.4 on 1 and 2435 DF, p-value: < 2.2e-16
Plots – tell me what they tell you!
Solution model 2 > m2<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx) > summary(m2) > plot(resid(m2)) # > m2a<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx) > summary(m2a) > plot(resid(m2a))
How do you interpret this residual plot?
Solution model 3 and 4 > m3<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx) > summary(m3) > plot(resid(m3)) # > m4<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx) > summary(m4) > plot(resid(m4))
And this one?
A complex example > install.packages("geoPlot") > install.packages(”xslx") > require(class) > require(gdata) > require(geoPlot) > require(”xslx”) #if not already read-in: bronx<-read.xlsx(”sales/rollingsales_bronx.xls",pattern="BOROUGH",stringsAsFactors=FALSE,sheetIndex=1,startRow=5,header=TRUE)
View(bronx) #clean up with regular expressions bronx$SALE.PRICE<- as.numeric(gsub("[^]:digit:]]","",bronx$SALE.PRICE)) #missing values? sum(is.na(bronx$SALE.PRICE)) #zero sale prices sum(bronx$SALE.PRICE==0) #clean these numeric and date fields bronx$GROSS.SQUARE.FEET<- as.numeric(gsub("[^]:digit:]]","",bronx$GROSS.SQUARE.FEET)) bronx$LAND.SQUARE.FEET<- as.numeric(gsub("[^]:digit:]]","",bronx$LAND.SQUARE.FEET)) bronx$SALE.DATE<- as.Date(gsub("[^]:digit:]]","",bronx$SALE.DATE)) bronx$YEAR.BUILT<- as.numeric(gsub("[^]:digit:]]","",bronx$YEAR.BUILT)) bronx$ZIP.CODE<- as.character(gsub("[^]:digit:]]","",bronx$ZIP.CODE))
More corrections #filter out low prices minprice<-10000 bronx<-bronx[which(bronx$SALE.PRICE>=minprice),] #how many left? nval<-dim(bronx)[1] #addresses contain apartment #'s even though there is another column for that - remove them - compresses addresses bronx$ADDRESSONLY<- gsub("[,][[:print:]]*","",gsub("[ ]+","",trim(bronx$ADDRESS))) #new data frame for sorting the addresses, fixing etc. bronxadd<-unique(data.frame(bronx$ADDRESSONLY, bronx$ZIP.CODE,stringsAsFactors=FALSE)) # fix the names names(bronxadd)<-c("ADDRESSONLY","ZIP.CODE") bronxadd<-bronxadd[order(bronxadd$ADDRESSONLY),]
Yep, more… # duplicates? duplicates<-duplicated(bronxadd$ADDRESSONLY) ##if(duplicates) dupadd<-bronxadd[bronxadd$duplicates,1] ##bronxadd<-bronxadd[(bronxadd$ADDRESSONLY!=dupadd[1] & bronxadd$ADDRESSONLY != dupadd[2]),] #how many? nadd<-dim(bronxadd)[1]
Oh, we want nearest neighbors? How? #problem, we need a spatial distribution since none of the columns have that #we will use google maps so limit the number to under 500 (ask me why) nsample=450 addsample<-bronxadd[sample.int(nadd,size=nsample),] #new data frame for the full address addrlist<-data.frame(1:nsample,addsample$ADDRESSONLY,rep("NEW YORK",times=nsample), rep("NY",times=nsample), addsample$ZIP.CODE, rep("US",times=nsample)) #look them up querylist<-addrListLookup(addrlist)
Lots missing – why? # how many returned valid lat/long? querylist$matched <- (querylist$latitude !=0) unmatchedid<- which(!querylist$matched) #MANY missing - what's up? unmatched<- length(unmatchedid) #WEST -> W and EAST -> E - do again. addrlist2<-data.frame(1:unmatched,gsub(" WEST "," W ",gsub(" EAST "," E ",addsample[unmatchedid,1])),rep("NEW YORK",times=unmatched),rep("NY",times=unmatched),addsample[unmatchedid,2],rep("US",times=unmatched)) querylist[unmatchedid,1:4]<-addrListLookup(addrlist2)[,1:4]
Not enough #this fixed a LOT but we need more: STREET and AVENUE (could have done PLACE) and others addrlist3<-data.frame(1:unmatched,gsub("WEST","W",gsub("EAST","E",gsub("STREET","ST ", gsub("AVENUE","AVE", addsample[unmatchedid,1])))),rep("NEW YORK", times=unmatched), rep("NY",times=unmatched), addsample[unmatchedid,2], rep("US",times=unmatched)) querylist[unmatchedid,1:4]<-addrListLookup(addrlist3)[,1:4] querylist$matched <- (querylist$latitude !=0) unmatchedid<- which(!querylist$matched) unmatched<- length(unmatchedid) # 9 left now? good enough.
Rebuild! addsample<-cbind(addsample,querylist$latitude,querylist$longitude) ##names(addsample[3:4])<-c("latitude","longitude") - this was meant to correct the column names but did not work for me addsample<-addsample[addsample$'querylist$latitude'!=0,] # note ' ' to work around column name adduse<-merge(bronx,addsample) adduse<-adduse[!is.na(adduse$latitude),]
Most satisfying part! mapcoord<-adduse[,c(2,4,24,25)] table(mapcoord$NEIGHBORHOOD) mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD) # geoPlot(mapcoord,zoom=12,color=mapcoord$NEIGHBORHOOD)
Did you forget the KNN? #almost there. mapcoord$class<as.numeric(mapcoord$NEIGHBORHOOD) nclass<-dim(mapcoord)[1] split<-0.8 trainid<-sample.int(nclass,floor(split*nclass)) testid<-(1:nclass)[-trainid] ##mappred<-mapcoord[testid,] ##mappred$class<as.numeric(mappred$NEIGHBORHOOD)
KNN! kmax<-10 knnpred<-matrix(NA,ncol=kmax,nrow=length(testid)) knntesterr<-rep(NA,times=kmax) for (i in 1:kmax){ # loop over k knnpred[,i]<-knn(mapcoord[trainid,3:4],mapcoord[testid,3:4],cl=mapcoord[trainid,2],k=i) knntesterr[i]<-sum(knnpred[,i]!=mapcoord[testid,2])/length(testid) }
KNN! Did you loop over k? { knnpred<-knn(mapcoord[trainid,3:4],mapcoord[testid,3:4],cl=mapcoord[trainid,2],k=5) knntesterr<-sum(knnpred!=mappred$class)/length(testid) } knntesterr [1] 0.1028037 0.1308411 0.1308411 0.1588785 0.1401869 0.1495327 0.1682243 0.1962617 0.1962617 0.1869159 What do you think?
Finally K-Means! > mapmeans<-data.frame(adduse$ZIP.CODE, as.numeric(mapcoord$NEIGHBORHOOD), adduse$TOTAL.UNITS, adduse$"LAND.SQUARE.FEET", adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude') > mapobj<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) > fitted(mapobj,method=c("centers","classes")) > plot(mapmeans,mapobj$cluster)
Did you get to create the sales map? table(mapcoord$NEIGHBORHOOD) # contingency table mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD) # and this? geoPlot(mapcoord,zoom=12,color=mapcoord$NEIGHBORHOOD) # this one is easier
Return object cluster A vector of integers (from 1:k) indicating the cluster to which each point is allocated. centers A matrix of cluster centres. totss The total sum of squares. withinss Vector of within-cluster sum of squares, one component per cluster. tot.withinss Total within-cluster sum of squares, i.e., sum(withinss). betweenss The between-cluster sum of squares, i.e. totss-tot.withinss. size The number of points in each cluster.
Plotting clusters (preview) library(cluster) clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) # Centroid Plot against 1st 2 discriminant functions library(fpc) plotcluster(mapmeans, mapobj$cluster)
Comparing cluster fits (e.g. different k) library(fpc) cluster.stats(d, fit1$cluster, fit2$cluster) Use help. > help(plotcluster) > help(cluster.stats)