Download presentation
Presentation is loading. Please wait.
1
Group 1 Lab 2 exercises and Assignment 2
Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960 Group 1, Lab 2, September 14, 2018
2
Labs 2a. Regression 2b. kNN 2c. Kmeans Do all three for Assignment 2
New multivariate dataset 2b. kNN New Abalone dataset 2c. Kmeans (Sort of) New Iris dataset Do all three for Assignment 2 And then general exercises
3
The Dataset(s) http://aquarius.tw.rpi.edu/html/DA
See slides: DataAnalytics2018Fall_Assignment_2.pptx on LMS or (under week 3) Code fragments, i.e. they will not run as-is, on the following slides as group1/lab2_knn1.R, etc.
4
Remember a few useful cmds
head(<object>) tail(<object>) summary(<object>)
5
Regression Exercises Using the EPI (under /EPI on web) dataset find the single most important factor in increasing the EPI in a given region Examine distributions down to the leaf nodes and build up an EPI “model”
6
boxplot(ENVHEALTH,ECOSYSTEM)
7
qqplot(ENVHEALTH,ECOSYSTEM)
8
ENVHEALTH/ ECOSYSTEM > shapiro.test(ENVHEALTH) Shapiro-Wilk normality test data: ENVHEALTH W = , p-value = 1.083e Reject. > shapiro.test(ECOSYSTEM) data: ECOSYSTEM W = , p-value = ~reject
9
Kolmogorov- Smirnov - KS test -
> ks.test(ENVHEALTH,ECOSYSTEM) Two-sample Kolmogorov-Smirnov test data: ENVHEALTH and ECOSYSTEM D = , p-value = 5.413e-07 alternative hypothesis: two-sided Warning message: In ks.test(ENVHEALTH, ECOSYSTEM) : p-value will be approximate in the presence of ties
10
Linear and least-squares
> EPI_data <- read.csv(”EPI_data.csv") > attach(EPI_data); > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH … (what should you get?) > summary(lmENVH) … > cENVH<-coef(lmENVH)
11
Predict > DALYNEW<-c(seq(5,95,5)) > AIR_HNEW<-c(seq(5,95,5)) > WATER_HNEW<-c(seq(5,95,5)) > NEW<-data.frame(DALYNEW,AIR_HNEW,WATER_HNEW) > pENV<- predict(lmENVH,NEW,interval=“prediction”) > cENV<- predict(lmENVH,NEW,interval=“confidence”)
12
Repeat for AIR_E CLIMATE
13
Classification Exercises (group1/lab2_knn1.R)
> nyt1<-read.csv(“nyt1.csv") > nyt1<-nyt1[which(nyt1$Impressions>0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1] # shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value) # interpretation to come!
14
Classification Exercises (group1/lab2_knn2.R)
2 examples in the script
15
Clustering Exercises group1/lab2_kmeans1.R
group1/lab2_kmeans2.R – plotting up results from the iris clustering
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.