1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.


Similar presentations
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.

BA 275 Quantitative Business Methods
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics.
October 6, 2009 Session 6Slide 1 PSC 5940: Running Basic Multi- Level Models in R Session 6 Fall, 2009.
Multiple Regression Predicting a response with multiple explanatory variables.
Zinc Data SPH 247 Statistical Analysis of Laboratory Data.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
Lecture 23: Tues., Dec. 2 Today: Thursday:
1 BA 275 Quantitative Business Methods Residual Analysis Multiple Linear Regression Adjusted R-squared Prediction Dummy Variables Agenda.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source:
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3b, February 7, 2014 Lab exercises: datasets and data infrastructure.
How to plot x-y data and put statistics analysis on GLEON Fellowship Workshop January 14-18, 2013 Sunapee, NH Ari Santoso.
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
PCA Example Air pollution in 41 cities in the USA.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
SWC Methodology - TWG February 19, 2015 Settlement Document Subject to I.R.E. 408.
MATH 3359 Introduction to Mathematical Modeling Project Multiple Linear Regression Multiple Logistic Regression.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 30, 2015 Introductory Statistics/ Refresher and Relevant software installation.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 10, 2015 Introduction to Analytic Methods, Types of Data Mining for Analytics.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10a, April 1, 2014 Support Vector Machines.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
Collaboration and Data Sharing What have I been doing that’s so bad, and how could it be better? August 1 st, 2010.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 24, 2014 Relevant software and getting it installed.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 4, 2014 Lab: More on Support Vector Machines, Trees, and your projects.
Using R for Marketing Research Dan Toomey 2/23/2015
FACTORS AFFECTING HOUSING PRICES IN SYRACUSE Sample collected from Zillow in January, 2015 Urban Policy Class Exercise - Lecy.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Determining Factors of GPA Natalie Arndt Allison Mucha MA /6/07.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Linear Models Alan Lee Sample presentation for STATS 760.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
The Effect of Race on Wage by Region. To what extent were black males paid less than nonblack males in the same region with the same levels of education.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.
1 Analysis of Variance (ANOVA) EPP 245/298 Statistical Analysis of Laboratory Data.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…
Data Analytics – ITWS-4963/ITWS-6965
Lecture 11: Simple Linear Regression
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Data Analytics – ITWS-4600/ITWS-6600
Résolution de l’ex 1 p40 t=c(2:12);N=c(55,90,135,245,403,665,1100,1810,3000,4450,7350) T=data.frame(t,N,y=log(N));T; > T t N y
Group 1 Lab 2 exercises /assignment 2
Classification, Clustering and Bayes…
Data Analytics – ITWS-4963/ITWS-6965
Correlation and regression
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Group 1 Lab 2 exercises and Assignment 2
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Console Editeur : myProg.R 1
Multi Linear Regression Lab
Classification, Clustering and Bayes…
Assignment 2 (in lab) Peter Fox and Greg Hughes
Classification, Clustering and Bayes…
Group 1 Lab 2 exercises and Assignment 2
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means

Today Linear regression K Nearest Neighbors K Means 2

The Dataset(s) Some new ones; nyt/ and sales/ and the fb100/ (.mat files) 3 script (fragments, i.e. they will not run as-is, I think) to help with code for today: Lab4b_{1,2,3}.R 3

Linear and least-squares > multivariate <- read.csv(”EPI_data.csv") > attach(EPI_data); > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<- lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH … (what should you get?) > summary(lmENVH) … > cENVH<-coef(lmENVH) 4

Predict > DALYNEW<-c(seq(5,95,5)) > AIR_HNEW<-c(seq(5,95,5)) > WATER_HNEW<-c(seq(5,95,5)) > NEW<- data.frame(DALYNEW,AIR_HNEW,WATER_H NEW) > pENV<- predict(lmENV,NEW,interval=“prediction”) > cENV<- predict(lmENV,NEW,interval=“confidence”) 5

Repeat for AIR_E CLIMATE 6

Remember a few useful cmds head( ) tail( ) summary( ) 7

K Nearest Neighbors (classification) > nyt1<-read.csv(“nyt1.csv") > nyt1 0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1]# shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value) # interpretation to come! 8

Regression > bronx<- read.xlsx(”sales/rollingsales_bronx.xls",pattern ="BOROUGH",stringsAsFactors=FALSE,sheetI ndex=1,startRow=5,header=TRUE) > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) ) > m1<- lm(log(bronx$SALE.PRICE)~log(bronx$GROS S.SQUARE.FEET),data=bronx)  What’s wrong? 9

Clean up… > bronx 0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),] > m1<- lm(log(bronx$SALE.PRICE)~log(bronx$GROS S.SQUARE.FEET),data=bronx) # > summary(m1) 10

Call: lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) <2e-16 *** log(GROSS.SQUARE.FEET) <2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.95 on 2435 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 2435 DF, p-value: < 2.2e-16 11

Plot > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE)) > abline(m1,col="red",lwd=2) # then > plot(resid(m1)) 12

Another model (2)? Add two more variables to the linear model LAND.SQUARE.FEET and NEIGHBORHOOD Repeat but suppress the intercept (2a) 13


Solution model 2 > m2<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEE T)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBO RHOOD),data=bronx) > summary(m2) > plot(resid(m2)) # > m2a<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD),data=bronx) > summary(m2a) > plot(resid(m2a)) 15


Solution model 3 and 4 > m3<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),dat a=bronx) > summary(m3) > plot(resid(m3)) # > m4<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),dat a=bronx) > summary(m4) > plot(resid(m4)) 17


And now… a complex example > install.packages("geoPlot") > install.packages(”xslx") > require(class) > require(gdata) > require(geoPlot) > require(”xslx”) #if not already read-in: bronx<- read.xlsx(”sales/rollingsales_bronx.xls",pattern ="BOROUGH",stringsAsFactors=FALSE,sheetI ndex=1,startRow=5,header=TRUE) 19

View(bronx) #clean up with regular expressions bronx$SALE.PRICE<- as.numeric(gsub("[^]:digit:]]","",bronx$SALE.PRICE)) #missing values? sum(is.na(bronx$SALE.PRICE)) #zero sale prices sum(bronx$SALE.PRICE==0) #clean these numeric and date fields bronx$GROSS.SQUARE.FEET<- as.numeric(gsub("[^]:digit:]]","",bronx$GROSS.SQUARE.FEET)) bronx$LAND.SQUARE.FEET<- as.numeric(gsub("[^]:digit:]]","",bronx$LAND.SQUARE.FEET)) bronx$SALE.DATE<- as.Date(gsub("[^]:digit:]]","",bronx$SALE.DATE)) bronx$YEAR.BUILT<- as.numeric(gsub("[^]:digit:]]","",bronx$YEAR.BUILT)) bronx$ZIP.CODE<- as.character(gsub("[^]:digit:]]","",bronx$ZIP.CODE)) 20

More corrections #filter out low prices minprice< bronx =minprice),] #how many left? nval<-dim(bronx)[1] #addresses contain apartment #'s even though there is another column for that - remove them - compresses addresses bronx$ADDRESSONLY<- gsub("[,][[:print:]]*","",gsub("[ ]+","",trim(bronx$ADDRESS))) #new data frame for sorting the addresses, fixing etc. bronxadd<-unique(data.frame(bronx$ADDRESSONLY, bronx$ZIP.CODE,stringsAsFactors=FALSE)) # fix the names names(bronxadd)<-c("ADDRESSONLY","ZIP.CODE") bronxadd<-bronxadd[order(bronxadd$ADDRESSONLY),] 21

Yep, more… # duplicates? duplicates<- duplicated(bronxadd$ADDRESSONLY) ##if(duplicates) dupadd<- bronxadd[bronxadd$duplicates,1] ##bronxadd<- bronxadd[(bronxadd$ADDRESSONLY!=dupad d[1] & bronxadd$ADDRESSONLY != dupadd[2]),] #how many? nadd<-dim(bronxadd)[1] 22

Oh, we want nearest neighbors? How? #problem, we need a spatial distribution since none of the columns have that #we will use google maps so limit the number to under 500 (ask me why) nsample=450 addsample<-bronxadd[sample.int(nadd,size=nsample),] #new data frame for the full address addrlist<- data.frame(1:nsample,addsample$ADDRESSONLY,rep("NEW YORK",times=nsample),rep("NY",times=nsample),addsample$ ZIP.CODE,rep("US",times=nsample)) #look them up querylist<-addrListLookup(addrlist) 23

Lots missing – why? # how many returned valid lat/long? querylist$matched <- (querylist$latitude !=0) unmatchedid<- which(!querylist$matched) #MANY missing - what's up? unmatched<- length(unmatchedid) #WEST -> W and EAST -> E - do again. addrlist2<-data.frame(1:unmatched,gsub(" WEST "," W ",gsub(" EAST "," E ",addsample[unmatchedid,1])),rep("NEW YORK",times=unmatched),rep("NY",times=unmatched),addsa mple[unmatchedid,2],rep("US",times=unmatched)) querylist[unmatchedid,1:4]<-addrListLookup(addrlist2)[,1:4] querylist$matched <- (querylist$latitude !=0) unmatchedid<- which(!querylist$matched) unmatched<- length(unmatchedid) 24

Not enough #this fixed a LOT but we need more: STREET and AVENUE (could have done PLACE) and others addrlist3<- data.frame(1:unmatched,gsub("WEST","W",gsub("EAST","E",g sub("STREET","ST ", gsub("AVENUE","AVE", addsample[unmatchedid,1])))),rep("NEW YORK", times=unmatched), rep("NY",times=unmatched), addsample[unmatchedid,2], rep("US",times=unmatched)) querylist[unmatchedid,1:4]<-addrListLookup(addrlist3)[,1:4] querylist$matched <- (querylist$latitude !=0) unmatchedid<- which(!querylist$matched) unmatched<- length(unmatchedid) # 9 left now? good enough. 25

Rebuild! addsample<- cbind(addsample,querylist$latitude,querylist$lo ngitude) ##names(addsample[3:4])<- c("latitude","longitude") - this was meant to correct the column names but did not work for me addsample<- addsample[addsample$'querylist$latitude'!=0,] # note ' ' to work around column name adduse<-merge(bronx,addsample) adduse<-adduse[!is.na(adduse$latitude),] 26

Most satisfying part! mapcoord<-adduse[,c(2,4,24,25)] table(mapcoord$NEIGHBORHOOD) mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD) # geoPlot(mapcoord,zoom=12,color=mapcoord$ NEIGHBORHOOD) 27


Did you forget the KNN? #almost there. mapcoord$class<as.numeric(mapcoord$NEIG HBORHOOD) nclass<-dim(mapcoord)[1] split<-0.8 trainid<-sample.int(nclass,floor(split*nclass)) testid<-(1:nclass)[-trainid] ##mappred<-mapcoord[testid,] ##mappred$class<as.numeric(mappred$NEIG HBORHOOD) 29

KNN! kmax<-10 knnpred<- matrix(NA,ncol=kmax,nrow=length(testid)) knntesterr<-rep(NA,times=kmax) for (i in 1:kmax){# loop over k knnpred[,i]<- knn(mapcoord[trainid,3:4],mapcoord[testid,3:4], cl=mapcoord[trainid,2],k=i) knntesterr[i]<- sum(knnpred[,i]!=mapcoord[testid,2])/length(tes tid) } 30

Finally K-Means! > mapmeans<-data.frame(adduse$ZIP.CODE, as.numeric(mapcoord$NEIGHBORHOOD), adduse$TOTAL.UNITS, adduse$"LAND.SQUARE.FEET", adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude') > mapobj<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) > fitted(mapobj,method=c("centers","classes")) > plot(mapmeans,mapobj$cluster) 31


Assignment 3 Preliminary and Statistical Analysis. Due ~ Feb % (written) –Distribution analysis and comparison, visual ‘analysis’, statistical model fitting and testing of some of the nyt1…31 datasets. 33

Tentative assignments Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ early March. 15% (10% written and 5% oral; individual); Assignment 5: Term project proposal. Due ~ week 7. 5% (0% written and 5% oral; individual); Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual); Term project. Due ~ week % (25% written, 5% oral; individual). 34

Admin info (keep/ print this slide) Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: (do not leave a Contact hours: Monday** 3:00-4:00pm (or by appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by ) TA: Lakshmi Chenicheri Web site: –Schedule, lectures, syllabus, reading, assignments, etc. 35

Table: Matlab/R/scipy-numpy 36