1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…

Slides:

Advertisements

Similar presentations

SPH 247 Statistical Analysis of Laboratory Data 1April 2, 2013SPH 247 Statistical Analysis of Laboratory Data.

Advertisements

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 10, 2015 Labs: more data, models, prediction, deciding with trees.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics.

Review of Univariate Linear Regression BMTRY 726 3/4/14.

5/11/ lecture 71 STATS 330: Lecture 7. 5/11/ lecture 72 Prediction Aims of today’s lecture  Describe how to use the regression model to.

October 6, 2009 Session 6Slide 1 PSC 5940: Running Basic Multi- Level Models in R Session 6 Fall, 2009.

Multiple Regression Predicting a response with multiple explanatory variables.

Zinc Data SPH 247 Statistical Analysis of Laboratory Data.

x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]

SPH 247 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.

Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.

Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.

7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,

POSTER TEMPLATE BY: Cluster-Based Modeling: Exploring the Linear Regression Model Space Student: XiaYi(Sandy) Shen Advisor:

Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source:

Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.

How to plot x-y data and put statistics analysis on GLEON Fellowship Workshop January 14-18, 2013 Sunapee, NH Ari Santoso.

BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.

PCA Example Air pollution in 41 cities in the USA.

9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.

Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression BMTRY 701 Biostatistical Methods II.

 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.

Chapter 14 Introduction to Multiple Regression

Brain Mapping Unit The General Linear Model A Basic Introduction Roger Tait

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 10, 2015 Introduction to Analytic Methods, Types of Data Mining for Analytics.

Use of Weighted Least Squares. In fitting models of the form y i = f(x i ) +  i i = 1………n, least squares is optimal under the condition  1 ……….  n.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.

Regression and Analysis Variance Linear Models in R.

Using R for Marketing Research Dan Toomey 2/23/2015

FACTORS AFFECTING HOUSING PRICES IN SYRACUSE Sample collected from Zillow in January, 2015 Urban Policy Class Exercise - Lecy.

Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.

Lecture 11 Multicollinearity BMTRY 701 Biostatistical Methods II.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6b, February 28, 2014 Weighted kNN, clustering, more plottong, Bayes.

Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.

Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.

Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.

Determining Factors of GPA Natalie Arndt Allison Mucha MA /6/07.

Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.

Linear Models Alan Lee Sample presentation for STATS 760.

RazorFish Data Exploration-KMeans

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

EPP 245 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.

Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.

Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.

The Effect of Race on Wage by Region. To what extent were black males paid less than nonblack males in the same region with the same levels of education.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.

1 Analysis of Variance (ANOVA) EPP 245/298 Statistical Analysis of Laboratory Data.

Appendix I A Refresher on some Statistical Terms and Tests.

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Unit 9: Dealing with Messy Data I: Case Analysis

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Data Analytics – ITWS-4600/ITWS-6600

Résolution de l’ex 1 p40 t=c(2:12);N=c(55,90,135,245,403,665,1100,1810,3000,4450,7350) T=data.frame(t,N,y=log(N));T; > T t N y

Data Analytics – ITWS-4963/ITWS-6965

Correlation and regression

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Data Analytics – ITWS-4600/ITWS-6600/MATP-4450

Data Analytics – ITWS-4600/ITWS-6600/MATP-4450

Weighted kNN, clustering, “early” trees and Bayesian

Console Editeur : myProg.R 1

Welcome to the class! set.seed(843) df <- tibble::data_frame(

Multi Linear Regression Lab

Assignment 2 (in lab) Peter Fox and Greg Hughes

ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960

Cross-validation Brenda Thomson/ Peter Fox Data Analytics

ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960

Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…

Assignment 6 preview Your term projects should fall within the scope of a data analytics problem of the type you have worked with in class/ labs, or know of yourself – the bigger the data the better. This means that the work must go beyond just making lots of figures. You should develop the project to indicate you are thinking of and exploring the relationships and distributions within your data. Start with a hypothesis, think of a way to model and use the hypothesis, find or collect the necessary data, and do both preliminary analysis, detailed modeling and summary (interpretation). –Note: You do not have to come up with a positive result, i.e. disproving the hypothesis is just as good. Please use the section numbering below for your written submission for this assignment. Introduction (2%) Data Description (3%) Analysis (8%) Model Development (8%) Conclusions and Discussion (4%) Oral presentation (5%) (10 mins) 2

Contents 3

Regression > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) ) > m1<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bron x) You were reminded that log(0) is … not fun  THINK through what you are doing… Filtering is somewhat inevitable: > bronx 0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),] > m1<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bron x) 4

Interpreting this! Call: lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) <2e-16 *** log(GROSS.SQUARE.FEET) <2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.95 on 2435 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 2435 DF, p-value: < 2.2e-16 5

Plots – tell me what they tell you! 6

Solution model 2 > m2<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEE T)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBO RHOOD),data=bronx) > summary(m2) > plot(resid(m2)) # > m2a<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD),data=bronx) > summary(m2a) > plot(resid(m2a)) 7

8 How do you interpret this residual plot?

Solution model 3 and 4 > m3<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),dat a=bronx) > summary(m3) > plot(resid(m3)) # > m4<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),dat a=bronx) > summary(m4) > plot(resid(m4)) 9

10 And this one?

11

bronx1$SALE.PRICE<-sub("\\$","",bronx1$SALE.PRICE) bronx1$SALE.PRICE<-as.numeric(gsub(",","", bronx1$SALE.PRICE)) bronx1$GROSS.SQUARE.FEET<-as.numeric(gsub(",","", bronx1$GROSS.SQUARE.FEET)) bronx1$LAND.SQUARE.FEET<-as.numeric(gsub(",","", bronx1$LAND.SQUARE.FEET)) bronx1$SALE.DATE<- as.Date(gsub("[^]:digit:]]","",bronx1$SALE.DATE)) bronx1$YEAR.BUILT<- as.numeric(gsub("[^]:digit:]]","",bronx1$YEAR.BUILT)) bronx1$ZIP.CODE<- as.character(gsub("[^]:digit:]]","",bronx1$ZIP.CODE)) minprice< bronx1 =minprice),] nval<-dim(bronx1)[1] bronx1$ADDRESSONLY<- gsub("[,][[:print:]]*","",gsub("[ ]+","",trim(bronx1$ADDRESS))) bronxadd<- unique(data.frame(bronx1$ADDRESSONLY, bronx1$ZIP.CODE,stringsAsFactors=FALSE)) names(bronxadd)<- c("ADDRESSONLY","ZIP.CODE") bronxadd<- bronxadd[order(bronxadd$ADDRESSONLY),] duplicates<- duplicated(bronx1$ADDRESSONLY) 12

matched<-(querylist$lat!=0 &&querylist$lon!=0) addsample<- cbind(addsample,querylist$lat,querylist$lon) names(addsample)<- c("ADDRESSONLY","ZIPCODE","Latitude","Longitude")# correct the column na adduse<-merge(bronx1,addsample) adduse<-adduse[!is.na(adduse$Latitude),] mapcoord<-adduse[,c(2,3,24,25)] table(mapcoord$NEIGHBORHOOD) mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD) map <- get_map(location = 'Bronx', zoom = 12)#Zoom 11 or 12 ggmap(map) + geom_point(aes(x = mapcoord$Longitude, y = mapcoord$Latitude, size =1, color=mapcoord$NEIGHBORHOOD), data = mapcoord) +theme(legend.position = "none") 13

Did you get to create the neighborhood map? table(mapcoord$NEIGHBORHOOD) mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD) The MAP!! 14

mapmeans<- cbind(adduse,as.numeric(mapcoord$NEIGHBORHOOD)) colnames(mapmeans)[26] <- "NEIGHBORHOOD" #This is the right way of renaming. keeps <- c("ZIP.CODE","NEIGHBORHOOD","TOTAL.UNITS","LAND.SQ UARE.FEET","GROSS.SQUARE.FEET","SALE.PRICE","Latitu de","Longitude") mapmeans<-mapmeans[keeps]#Dropping others mapmeans$NEIGHBORHOOD<- as.numeric(mapcoord$NEIGHBORHOOD) for(i in 1:8){ mapmeans[,i]=as.numeric(mapmeans[,i]) }#Now done for conversion to numeric 15

#Classification mapcoord$class<as.numeric(mapcoord$NEIG HBORHOOD) nclass<-dim(mapcoord)[1] split<-0.8 trainid<-sample.int(nclass,floor(split*nclass)) testid<-(1:nclass)[-trainid] 16

KNN! Did you loop over k? { knnpred<- knn(mapcoord[trainid,3:4],mapcoord[testid,3:4],cl=ma pcoord[trainid,2],k=5) knntesterr<- sum(knnpred!=mappred$class)/length(testid) } knntesterr [1] What do you think? 17

Try these on mapmeans, etc. 18

K-Means! > mapmeans<-data.frame(adduse$ZIP.CODE, as.numeric(mapcoord$NEIGHBORHOOD), adduse$TOTAL.UNITS, adduse$"LAND.SQUARE.FEET", adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude') > mapobj<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) > fitted(mapobj,method=c("centers","classes")) 19

20 > mapobj$centers adduse.ZIP.CODE as.numeric.mapcoord.NEIGHBORHOOD. adduse.TOTAL.UNITS adduse.LAND.SQUARE.FEET adduse.GROSS.SQUARE.FEET adduse.SALE.PRICE adduse..querylist.latitude. adduse..querylist.longitude

21 > plot(mapmeans,mapobj$clu ster) ZIP.CODE, NEIGHBORHOOD, TOTAL.UNITS, LAND.SQUARE.FEET, GROSS.SQUARE.FEET, SALE.PRICE, latitude, longitude' ZIP.CODE, NEIGHBORHOOD, TOTAL.UNITS, LAND.SF, GROSS.SF, SALE.PRICE, lat, long > mapobj$size [1]

Return object clusterA vector of integers (from 1:k) indicating the cluster to which each point is allocated. centersA matrix of cluster centres. totssThe total sum of squares. withinssVector of within-cluster sum of squares, one component per cluster. tot.withinssTotal within-cluster sum of squares, i.e., sum(withinss). betweenssThe between-cluster sum of squares, i.e. totss-tot.withinss. sizeThe number of points in each cluster. 22

Plotting clusters library(cluster) clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) # Centroid Plot against 1st 2 discriminant functions library(fpc) plotcluster(mapmeans, mapobj$cluster) 23

Plotting clusters require(cluster) clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) 24

Simpler K-Means! > mapmeans<- data.frame(as.numeric(mapcoord$NEIGHBORHOOD), adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude') > mapobjnew<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) > fitted(mapobjnew,method=c("centers","classes")) 25

Plot 26

Clusplot (k=17) 27

Dendogram for this = tree of the clusters: 28 Highly supported by data? Okay, this is a little complex – perhaps something simpler?

What else could you cluster/classify? SALE.PRICE? –If so, how would you measure error? # I added SALE.PRICE as 5 th column in adduse… > pcolor<- color.scale(log(mapcoord[,5]),c(0,1,1),c(1,1,0),0 ) > geoPlot(mapcoord,zoom=12,color=pcolor) TAX.CLASS.AT.PRESENT? TAX.CLASS.AT.TIME.OF.SALE? measure error? 29

Trees for the NYC housing dataset? Could you now flip over to a tree based method for this dataset? What might you expect? Combination of cluster and trees? –Hierarchical clustering! 30

Cluster plotting source(" content/uploads/2012/01/source_https.r.txt") # source code from github require(RCurl) require(colorspace) source_https(" snippets/master/clustergram.r") data(iris) set.seed(250) par(cex.lab = 1.5, cex.main = 1.2) Data <- scale(iris[,-5]) # scaling clustergram(Data, k.range = 2:8, line.width = 0.004) # line.width - adjust according to Y-scale 31

Clustergram 32

Any good? set.seed(500) Data2 <- scale(iris[,-5]) par(cex.lab = 1.2, cex.main =.7) par(mfrow = c(3,2)) for(i in 1:6) clustergram(Data2, k.range = 2:8, line.width =.004, add.center.points = T) 33

34

How can you tell it is good? set.seed(250) Data <- rbind(cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)), cbind(rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3))) clustergram(Data, k.range = 2:5, line.width =.004, add.center.points = T) 35

More complex… set.seed(250) Data <- rbind(cbind(rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3))) clustergram(Data, k.range = 2:8, line.width =.004, add.center.points = T) 36

37 Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together) Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together – hinting at the real number of clusters Run the plot multiple times to observe the stability of the cluster formation (and location)

Plotting clusters (DIY) library(cluster) clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) # Centroid Plot against 1st 2 discriminant functions #library(fpc) plotcluster(mapmeans, mapobj$cluster) dendogram? library(fpc) cluster.stats 38

Another example > A = c(1, 2.5); B = c(5, 10); C = c(23, 34) > D = c(45, 47); E = c(4, 17); F = c(18, 4) > df <- data.frame(rbind(A,B,C,D,E,F)) > colnames(df) <- c("x","y") > hc <- hclust(dist(df)) > plot(hc) > df$cluster <- cutree(hc,k=2) # 2 clusters > plot(y~x,df,col=cluster) 39

40

41

Swiss - pairs 42 pairs(~ Fertility + Education + Catholic, data = swiss, subset = Education < 20, main = "Swiss data, Education < 20")

ctree 43 require(party) swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss) plot(swiss_ctree)

ctree 44 require(party) swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic + Examination, data = swiss) plot(swiss_ctree)

Hierarchical clustering 45 > dswiss <- dist(as.matrix(swiss)) > hs <- hclust(dswiss) > plot(hs)

Clustering (kmeans) - swiss Or Bayes? Discuss… 46

Classification Bayes Retrieve the abalone.csv dataset Predicting the age of abalone from physical measurements Perform naivebayes classification to get predictors for Age (Rings) Compare to what you got from kknn (weighted nearest neighbors) in class 4b 47

Hair, eye color > data(HairEyeColor) > mosaicplot(HairEyeColor) > margin.table(HairEyeColor,3) Sex Male Female > margin.table(HairEyeColor,c(1,3)) # gives you – what? Sex Hair Male Female Black Brown Red Blond Construct a naïve Bayes classifier to predict “Sex” from the other two variables and test it! 48

nbayes1 > table(pred, HouseVotes84$Class) pred democrat republican democrat republican

> predict(model, HouseVotes84[1:10,-1], type = "raw") democrat republican [1,] e e-01 [2,] e e-01 [3,] e e-01 [4,] e e-03 [5,] e e-02 [6,] e e-01 [7,] e e-01 [8,] e e-01 [9,] e e-01 [10,] e e-11 50

See also Lab6a_kmeans_bayes_2016.R –Try clustergram –Try hclust –ctree Lab3b_kmeans1_2016.R –Try clustergram –Try hclust –ctree 51