Group 1 Lab 2 exercises and Assignment 2 Peter Fox Data Analytics – ITWS-4600/ITWS-6600/MATP-4450 Group 1, Lab 2, February 1, 2018
Labs 2a. Regression 2b. kNN 2c. Kmeans Do all three for Assignment 2 New multivariate dataset 2b. kNN New Abalone dataset 2c. Kmeans (Sort of) New Iris dataset Do all three for Assignment 2 And then general exercises
The Dataset(s) http://aquarius.tw.rpi.edu/html/DA See slides: DataAnalytics2018_Assignment_2.pptx on LMS or https://tw.rpi.edu/web/Courses/DataAnalytics/2018 (under week 3) Code fragments, i.e. they will not run as-is, on the following slides as group1/lab2_knn1.R, etc.
Remember a few useful cmds head(<object>) tail(<object>) summary(<object>)
Regression Exercises Using the EPI (under /EPI on web) dataset find the single most important factor in increasing the EPI in a given region Examine distributions down to the leaf nodes and build up an EPI “model”
boxplot(ENVHEALTH,ECOSYSTEM)
qqplot(ENVHEALTH,ECOSYSTEM)
ENVHEALTH/ ECOSYSTEM > shapiro.test(ENVHEALTH) Shapiro-Wilk normality test data: ENVHEALTH W = 0.9161, p-value = 1.083e-08 ------- Reject. > shapiro.test(ECOSYSTEM) data: ECOSYSTEM W = 0.9813, p-value = 0.02654 ----- ~reject
Kolmogorov- Smirnov - KS test - > ks.test(ENVHEALTH,ECOSYSTEM) Two-sample Kolmogorov-Smirnov test data: ENVHEALTH and ECOSYSTEM D = 0.2965, p-value = 5.413e-07 alternative hypothesis: two-sided Warning message: In ks.test(ENVHEALTH, ECOSYSTEM) : p-value will be approximate in the presence of ties
Linear and least-squares > EPI_data <- read.csv(”EPI_data.csv") > attach(EPI_data); > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH … (what should you get?) > summary(lmENVH) … > cENVH<-coef(lmENVH)
Predict > DALYNEW<-c(seq(5,95,5)) > AIR_HNEW<-c(seq(5,95,5)) > WATER_HNEW<-c(seq(5,95,5)) > NEW<-data.frame(DALYNEW,AIR_HNEW,WATER_HNEW) > pENV<- predict(lmENVH,NEW,interval=“prediction”) > cENV<- predict(lmENVH,NEW,interval=“confidence”)
Repeat for AIR_E CLIMATE
Classification Exercises (group1/lab2_knn1.R) > nyt1<-read.csv(“nyt1.csv") > nyt1<-nyt1[which(nyt1$Impressions>0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1] # shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value) # interpretation to come!
Classification Exercises (group1/lab2_knn2.R) 2 examples in the script
Clustering Exercises group1/lab2_kmeans1.R group1/lab2_kmeans2.R – plotting up results from the iris clustering