Group 1 Lab 2 exercises and Assignment 2

Group 1 Lab 2 exercises and Assignment 2
Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960 Group 1, Lab 2, September 14, 2018

Labs 2a. Regression 2b. kNN 2c. Kmeans Do all three for Assignment 2
New multivariate dataset 2b. kNN New Abalone dataset 2c. Kmeans (Sort of) New Iris dataset Do all three for Assignment 2 And then general exercises

The Dataset(s) http://aquarius.tw.rpi.edu/html/DA
See slides: DataAnalytics2018Fall_Assignment_2.pptx on LMS or (under week 3) Code fragments, i.e. they will not run as-is, on the following slides as group1/lab2_knn1.R, etc.

Remember a few useful cmds
head(<object>) tail(<object>) summary(<object>)

Regression Exercises Using the EPI (under /EPI on web) dataset find the single most important factor in increasing the EPI in a given region Examine distributions down to the leaf nodes and build up an EPI “model”

boxplot(ENVHEALTH,ECOSYSTEM)

qqplot(ENVHEALTH,ECOSYSTEM)

ENVHEALTH/ ECOSYSTEM > shapiro.test(ENVHEALTH) Shapiro-Wilk normality test data: ENVHEALTH W = , p-value = 1.083e  Reject. > shapiro.test(ECOSYSTEM) data: ECOSYSTEM W = , p-value =  ~reject

Kolmogorov- Smirnov - KS test -
> ks.test(ENVHEALTH,ECOSYSTEM) Two-sample Kolmogorov-Smirnov test data: ENVHEALTH and ECOSYSTEM D = , p-value = 5.413e-07 alternative hypothesis: two-sided Warning message: In ks.test(ENVHEALTH, ECOSYSTEM) : p-value will be approximate in the presence of ties

Linear and least-squares
> EPI_data <- read.csv(”EPI_data.csv") > attach(EPI_data); > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH … (what should you get?) > summary(lmENVH) … > cENVH<-coef(lmENVH)

Predict > DALYNEW<-c(seq(5,95,5)) > AIR_HNEW<-c(seq(5,95,5)) > WATER_HNEW<-c(seq(5,95,5)) > NEW<-data.frame(DALYNEW,AIR_HNEW,WATER_HNEW) > pENV<- predict(lmENVH,NEW,interval=“prediction”) > cENV<- predict(lmENVH,NEW,interval=“confidence”)

Repeat for AIR_E CLIMATE

Classification Exercises (group1/lab2_knn1.R)
> nyt1<-read.csv(“nyt1.csv") > nyt1<-nyt1[which(nyt1$Impressions>0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1] # shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value) # interpretation to come!

Classification Exercises (group1/lab2_knn2.R)
2 examples in the script

Clustering Exercises group1/lab2_kmeans1.R
group1/lab2_kmeans2.R – plotting up results from the iris clustering

Group 1 Lab 2 exercises and Assignment 2

Similar presentations

Presentation on theme: "Group 1 Lab 2 exercises and Assignment 2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Group 1 Lab 2 exercises and Assignment 2

Similar presentations

Presentation on theme: "Group 1 Lab 2 exercises and Assignment 2"— Presentation transcript:

Similar presentations

About project

Feedback