1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…

Assignment 6 preview Your term projects should fall within the scope of a data analytics problem of the type you have worked with in class/ labs, or know of yourself – the bigger the data the better. This means that the work must go beyond just making lots of figures. You should develop the project to indicate you are thinking of and exploring the relationships and distributions within your data. Start with a hypothesis, think of a way to model and use the hypothesis, find or collect the necessary data, and do both preliminary analysis, detailed modeling and summary (interpretation). –Note: You do not have to come up with a positive result, i.e. disproving the hypothesis is just as good. Please use the section numbering below for your written submission for this assignment. Introduction (2%) Data Description (3%) Analysis (8%) Model Development (8%) Conclusions and Discussion (4%) Oral presentation (5%) (10 mins) 2

Contents 3

Regression > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) ) > m1<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bron x) You were reminded that log(0) is … not fun  THINK through what you are doing… Filtering is somewhat inevitable: > bronx 0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),] > m1<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bron x) 4

Interpreting this! Call: lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx) Residuals: Min 1Q Median 3Q Max -14.4529 0.0377 0.4160 0.6572 3.8159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.0271 0.3088 22.75 <2e-16 *** log(GROSS.SQUARE.FEET) 0.7013 0.0379 18.50 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.95 on 2435 degrees of freedom Multiple R-squared: 0.1233, Adjusted R-squared: 0.1229 F-statistic: 342.4 on 1 and 2435 DF, p-value: < 2.2e-16 5

Plots – tell me what they tell you! 6

Solution model 2 > m2<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEE T)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBO RHOOD),data=bronx) > summary(m2) > plot(resid(m2)) # > m2a<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD),data=bronx) > summary(m2a) > plot(resid(m2a)) 7

8 How do you interpret this residual plot?

Solution model 3 and 4 > m3<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),dat a=bronx) > summary(m3) > plot(resid(m3)) # > m4<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),dat a=bronx) > summary(m4) > plot(resid(m4)) 9

10 And this one?

bronx1$SALE.PRICE<-sub("\\$","",bronx1$SALE.PRICE) bronx1$SALE.PRICE<-as.numeric(gsub(",","", bronx1$SALE.PRICE)) bronx1$GROSS.SQUARE.FEET<-as.numeric(gsub(",","", bronx1$GROSS.SQUARE.FEET)) bronx1$LAND.SQUARE.FEET<-as.numeric(gsub(",","", bronx1$LAND.SQUARE.FEET)) bronx1$SALE.DATE<- as.Date(gsub("[^]:digit:]]","",bronx1$SALE.DATE)) bronx1$YEAR.BUILT<- as.numeric(gsub("[^]:digit:]]","",bronx1$YEAR.BUILT)) bronx1$ZIP.CODE<- as.character(gsub("[^]:digit:]]","",bronx1$ZIP.CODE)) minprice<-10000 bronx1 =minprice),] nval<-dim(bronx1)[1] bronx1$ADDRESSONLY<- gsub("[,][[:print:]]*","",gsub("[ ]+","",trim(bronx1$ADDRESS))) bronxadd<- unique(data.frame(bronx1$ADDRESSONLY, bronx1$ZIP.CODE,stringsAsFactors=FALSE)) names(bronxadd)<- c("ADDRESSONLY","ZIP.CODE") bronxadd<- bronxadd[order(bronxadd$ADDRESSONLY),] duplicates<- duplicated(bronx1$ADDRESSONLY) 12

matched<-(querylist$lat!=0 &&querylist$lon!=0) addsample<- cbind(addsample,querylist$lat,querylist$lon) names(addsample)<- c("ADDRESSONLY","ZIPCODE","Latitude","Longitude")# correct the column na adduse<-merge(bronx1,addsample) adduse<-adduse[!is.na(adduse$Latitude),] mapcoord<-adduse[,c(2,3,24,25)] table(mapcoord$NEIGHBORHOOD) mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD) map <- get_map(location = 'Bronx', zoom = 12)#Zoom 11 or 12 ggmap(map) + geom_point(aes(x = mapcoord$Longitude, y = mapcoord$Latitude, size =1, color=mapcoord$NEIGHBORHOOD), data = mapcoord) +theme(legend.position = "none") 13

Did you get to create the neighborhood map? table(mapcoord$NEIGHBORHOOD) mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD) The MAP!! 14

mapmeans<- cbind(adduse,as.numeric(mapcoord$NEIGHBORHOOD)) colnames(mapmeans)[26] <- "NEIGHBORHOOD" #This is the right way of renaming. keeps <- c("ZIP.CODE","NEIGHBORHOOD","TOTAL.UNITS","LAND.SQ UARE.FEET","GROSS.SQUARE.FEET","SALE.PRICE","Latitu de","Longitude") mapmeans<-mapmeans[keeps]#Dropping others mapmeans$NEIGHBORHOOD<- as.numeric(mapcoord$NEIGHBORHOOD) for(i in 1:8){ mapmeans[,i]=as.numeric(mapmeans[,i]) }#Now done for conversion to numeric 15

#Classification mapcoord$class<as.numeric(mapcoord$NEIG HBORHOOD) nclass<-dim(mapcoord)[1] split<-0.8 trainid<-sample.int(nclass,floor(split*nclass)) testid<-(1:nclass)[-trainid] 16

KNN! Did you loop over k? { knnpred<- knn(mapcoord[trainid,3:4],mapcoord[testid,3:4],cl=ma pcoord[trainid,2],k=5) knntesterr<- sum(knnpred!=mappred$class)/length(testid) } knntesterr [1] 0.1028037 0.1308411 0.1308411 0.1588785 0.1401869 0.1495327 0.1682243 0.1962617 0.1962617 0.1869159 What do you think? 17

Try these on mapmeans, etc. 18

K-Means! > mapmeans<-data.frame(adduse$ZIP.CODE, as.numeric(mapcoord$NEIGHBORHOOD), adduse$TOTAL.UNITS, adduse$"LAND.SQUARE.FEET", adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude') > mapobj<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) > fitted(mapobj,method=c("centers","classes")) 19

20 > mapobj$centers adduse.ZIP.CODE as.numeric.mapcoord.NEIGHBORHOOD. adduse.TOTAL.UNITS adduse.LAND.SQUARE.FEET 1 10464.09 19.47454 1.550926 2028.285 2 10460.65 16.38710 25.419355 11077.419 3 10454.00 20.00000 1.000000 29000.000 4 10463.45 10.90909 42.181818 10462.273 5 10464.00 17.42857 4.714286 14042.214 adduse.GROSS.SQUARE.FEET adduse.SALE.PRICE adduse..querylist.latitude. adduse..querylist.longitude. 1 1712.887 279950.4 40.85280 -73.87357 2 26793.516 2944099.9 40.85597 -73.89139 3 87000.000 24120881.0 40.80441 -73.92290 4 40476.636 6953345.4 40.86009 -73.88632 5 9757.679 885950.9 40.85300 -73.87781

21 > plot(mapmeans,mapobj$clu ster) ZIP.CODE, NEIGHBORHOOD, TOTAL.UNITS, LAND.SQUARE.FEET, GROSS.SQUARE.FEET, SALE.PRICE, latitude, longitude' ZIP.CODE, NEIGHBORHOOD, TOTAL.UNITS, LAND.SF, GROSS.SF, SALE.PRICE, lat, long > mapobj$size [1] 432 31 1 11 56

Return object clusterA vector of integers (from 1:k) indicating the cluster to which each point is allocated. centersA matrix of cluster centres. totssThe total sum of squares. withinssVector of within-cluster sum of squares, one component per cluster. tot.withinssTotal within-cluster sum of squares, i.e., sum(withinss). betweenssThe between-cluster sum of squares, i.e. totss-tot.withinss. sizeThe number of points in each cluster. 22

Plotting clusters library(cluster) clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) # Centroid Plot against 1st 2 discriminant functions library(fpc) plotcluster(mapmeans, mapobj$cluster) 23

Plotting clusters require(cluster) clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) 24

Simpler K-Means! > mapmeans<- data.frame(as.numeric(mapcoord$NEIGHBORHOOD), adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude') > mapobjnew<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) > fitted(mapobjnew,method=c("centers","classes")) 25

Plot 26

Clusplot (k=17) 27

Dendogram for this = tree of the clusters: 28 Highly supported by data? Okay, this is a little complex – perhaps something simpler?

What else could you cluster/classify? SALE.PRICE? –If so, how would you measure error? # I added SALE.PRICE as 5 th column in adduse… > pcolor<- color.scale(log(mapcoord[,5]),c(0,1,1),c(1,1,0),0 ) > geoPlot(mapcoord,zoom=12,color=pcolor) TAX.CLASS.AT.PRESENT? TAX.CLASS.AT.TIME.OF.SALE? measure error? 29

Trees for the NYC housing dataset? Could you now flip over to a tree based method for this dataset? What might you expect? Combination of cluster and trees? –Hierarchical clustering! 30

Cluster plotting source("http://www.r-statistics.com/wp- content/uploads/2012/01/source_https.r.txt") # source code from github require(RCurl) require(colorspace) source_https("https://raw.github.com/talgalili/R-code- snippets/master/clustergram.r") data(iris) set.seed(250) par(cex.lab = 1.5, cex.main = 1.2) Data <- scale(iris[,-5]) # scaling clustergram(Data, k.range = 2:8, line.width = 0.004) # line.width - adjust according to Y-scale 31

Clustergram 32

Any good? set.seed(500) Data2 <- scale(iris[,-5]) par(cex.lab = 1.2, cex.main =.7) par(mfrow = c(3,2)) for(i in 1:6) clustergram(Data2, k.range = 2:8, line.width =.004, add.center.points = T) 33

How can you tell it is good? set.seed(250) Data <- rbind(cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)), cbind(rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3))) clustergram(Data, k.range = 2:5, line.width =.004, add.center.points = T) 35

More complex… set.seed(250) Data <- rbind(cbind(rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3))) clustergram(Data, k.range = 2:8, line.width =.004, add.center.points = T) 36

37 Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together) Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together – hinting at the real number of clusters Run the plot multiple times to observe the stability of the cluster formation (and location) http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/

Plotting clusters (DIY) library(cluster) clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) # Centroid Plot against 1st 2 discriminant functions #library(fpc) plotcluster(mapmeans, mapobj$cluster) dendogram? library(fpc) cluster.stats 38

Another example > A = c(1, 2.5); B = c(5, 10); C = c(23, 34) > D = c(45, 47); E = c(4, 17); F = c(18, 4) > df <- data.frame(rbind(A,B,C,D,E,F)) > colnames(df) <- c("x","y") > hc <- hclust(dist(df)) > plot(hc) > df$cluster <- cutree(hc,k=2) # 2 clusters > plot(y~x,df,col=cluster) 39

Swiss - pairs 42 pairs(~ Fertility + Education + Catholic, data = swiss, subset = Education < 20, main = "Swiss data, Education < 20")

ctree 43 require(party) swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss) plot(swiss_ctree)

ctree 44 require(party) swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic + Examination, data = swiss) plot(swiss_ctree)

Hierarchical clustering 45 > dswiss <- dist(as.matrix(swiss)) > hs <- hclust(dswiss) > plot(hs)

Clustering (kmeans) - swiss Or Bayes? Discuss… 46

Classification Bayes Retrieve the abalone.csv dataset Predicting the age of abalone from physical measurements Perform naivebayes classification to get predictors for Age (Rings) Compare to what you got from kknn (weighted nearest neighbors) in class 4b 47

Hair, eye color > data(HairEyeColor) > mosaicplot(HairEyeColor) > margin.table(HairEyeColor,3) Sex Male Female 279 313 > margin.table(HairEyeColor,c(1,3)) # gives you – what? Sex Hair Male Female Black 56 52 Brown 143 143 Red 34 37 Blond 46 81 Construct a naïve Bayes classifier to predict “Sex” from the other two variables and test it! 48

nbayes1 > table(pred, HouseVotes84$Class) pred democrat republican democrat 238 13 republican 29 155 49

> predict(model, HouseVotes84[1:10,-1], type = "raw") democrat republican [1,] 1.029209e-07 9.999999e-01 [2,] 5.820415e-08 9.999999e-01 [3,] 5.684937e-03 9.943151e-01 [4,] 9.985798e-01 1.420152e-03 [5,] 9.666720e-01 3.332802e-02 [6,] 8.121430e-01 1.878570e-01 [7,] 1.751512e-04 9.998248e-01 [8,] 8.300100e-06 9.999917e-01 [9,] 8.277705e-08 9.999999e-01 [10,] 1.000000e+00 5.029425e-11 50

See also Lab6a_kmeans_bayes_2016.R –Try clustergram –Try hclust –ctree Lab3b_kmeans1_2016.R –Try clustergram –Try hclust –ctree 51

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…

Similar presentations

Presentation on theme: "1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…

Similar presentations

Presentation on theme: "1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…"— Presentation transcript:

Similar presentations

About project

Feedback